773 research outputs found
To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging
Does normalization help Part-of-Speech (POS) tagging accuracy on noisy,
non-canonical data? To the best of our knowledge, little is known on the actual
impact of normalization in a real-world scenario, where gold error detection is
not available. We investigate the effect of automatic normalization on POS
tagging of tweets. We also compare normalization to strategies that leverage
large amounts of unlabeled data kept in its raw form. Our results show that
normalization helps, but does not add consistently beyond just word embedding
layer initialization. The latter approach yields a tagging model that is
competitive with a Twitter state-of-the-art tagger.Comment: In WNUT 201
Učinci habsburških obrazovnih politika mjereni statističkim alatima popisa stanov-ništva
This paper is dedicated to the multi-ethnic and multi-lingual Habsburg realm, in particular as regards school education, its effects and the census registration of linguistic qualities among its population. After almost a century of German language dominance, national revival of the Habsburg peoples forced school education to renounce the upbringing of a supra-national and linguistic uniform leadership. Secondary and higher education gradually chose to breed new nationally conscious elites in the variety of peoples, contributing to the decomposition of the realm.
Nevertheless, promotion of the ‘national languages’ resulted in wide spread bilingualism, at least among the middle and higher classes. This bilingualism, however, was restricted to the nationalities and not implemented to Austro-Germans and Magyars, who, in their own secon-dary educational institutions, stuck to a virtually unilingual practice, a fact that, in the end, weakened their political influence. This inequality has to be taken into consideration when different school types are put in a contraposition.
One of the most usual ways to investigate developments in the lingual capacity of the Habs-burg subjects is found in the decennial censuses, but these are presented with rigid and di-chotomous concepts, just describing ethno-lingual identities, however, aphoristically equated with political ‘nations’. This asks for clearer definitions, and this paper advocates a critical re-consideration of national and linguistic concepts and definitions, as habitually used in Habs-burg historiography. An exposé of different educational practices in both parts—Austria and Hungary—of the realm may serve as context to this appeal.Rad se bavi multietničkim i višejezičnim aspektima Habsburške monarhije, posebice po pi-tanju školstva, njegovim učincima i podatcima o lingvističkim odrednicama stanovnika mon-arhije zabilježenim u popisu stanovništva. Nakon gotovo cijelog stoljeća dominacije nje-mačkog jezika, nacionalno buđenje naroda u Habsburškoj monarhiji prisililo je škole da se odreknu nad-nacionalnog i lingvistički uniformnog odgoja. Srednje i visokoškolsko obra-zovanje usredotočilo se na obrazovanje nove nacionalno svjesne elite iz različitih nacionalnih skupina, što je doprinijelo raspadu monarhije.
Međutim, širenje “nacionalnih jezika” dovelo je do širenja bilingvizma među srednjom i višom klasom. Ovaj se bilingvizam međutim odnosio na nacionalne manjine a ne na Austri-jance, Nijemce i Mađare, koji su se u svojim obrazovnim institucijama obrazovali isključivo jednojezično, što je u konačnici dovelo do slabljenja njihovog političkog utjecaja. Ova se ne-jednakost mora uzeti u obzir kada se uspoređuju različite vrste škola.
Jedan od uobičajenih načina istraživanja razvoja lingvističkih sposobnosti građana monarhije je uvid u rezultate desetogodišnjeg popisa stanovništva, u kojem su podatci predstavljeni unu-tar krutih dihotomija, te samo opisuju etno-lingvističke identitete, koji se, međutim, aforistički izjednačuju s političkim “nacijama”. To zahtijeva jasnije definicije, pa se u ovom radu zala-žemo za kritičko promišljanje nacionalnih i lingvističkih koncepata i definicija koje su korištene u habsburškoj historiografiji. Ekspoze različitih obrazovnih praksi u oba dijela mon-arhije: Austriji i Mađarskoj, može poslužiti kao kontekst za to promišljanje
On the Effectiveness of Dataset Embeddings in Mono-lingual, Multi-lingual and Zero-shot Conditions
Recent complementary strands of research have shown that leveraging
information on the data source through encoding their properties into
embeddings can lead to performance increase when training a single model on
heterogeneous data sources. However, it remains unclear in which situations
these dataset embeddings are most effective, because they are used in a large
variety of settings, languages and tasks. Furthermore, it is usually assumed
that gold information on the data source is available, and that the test data
is from a distribution seen during training. In this work, we compare the
effect of dataset embeddings in mono-lingual settings, multi-lingual settings,
and with predicted data source label in a zero-shot setting. We evaluate on
three morphosyntactic tasks: morphological tagging, lemmatization, and
dependency parsing, and use 104 datasets, 66 languages, and two different
dataset grouping strategies. Performance increases are highest when the
datasets are of the same language, and we know from which distribution the
test-instance is drawn. In contrast, for setups where the data is from an
unseen distribution, performance increase vanishes
Entity Linking in the Job Market Domain
In Natural Language Processing, entity linking (EL) has centered around
Wikipedia, but yet remains underexplored for the job market domain.
Disambiguating skill mentions can help us get insight into the current labor
market demands. In this work, we are the first to explore EL in this domain,
specifically targeting the linkage of occupational skills to the ESCO taxonomy
(le Vrang et al., 2014). Previous efforts linked coarse-grained (full)
sentences to a corresponding ESCO skill. In this work, we link more
fine-grained span-level mentions of skills. We tune two high-performing neural
EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et
al., 2021), on a synthetically generated mention--skill pair dataset and
evaluate them on a human-annotated skill-linking benchmark. Our findings reveal
that both models are capable of linking implicit mentions of skills to their
correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict
evaluation, but GENRE performs better in loose evaluation (accuracy@).Comment: Accepted at EACL 2024 Finding
How Universal is Genre in Universal Dependencies?
This work provides the first in-depth analysis of genre in Universal
Dependencies (UD). In contrast to prior work on genre identification which uses
small sets of well-defined labels in mono-/bilingual setups, UD contains 18
genres with varying degrees of specificity spread across 114 languages. As most
treebanks are labeled with multiple genres while lacking annotations about
which instances belong to which genre, we propose four methods for predicting
instance-level genre using weak supervision from treebank metadata. The
proposed methods recover instance-level genre better than competitive baselines
as measured on a subset of UD with labeled instances and adhere better to the
global expected distribution. Our analysis sheds light on prior work using UD
genre metadata for treebank selection, finding that metadata alone are a noisy
signal and must be disentangled within treebanks before it can be universally
applied.Comment: Accepted at SyntaxFest 202
- …