763 research outputs found

    To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

    Full text link
    Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.Comment: In WNUT 201

    Učinci habsburških obrazovnih politika mjereni statističkim alatima popisa stanov-ništva

    Get PDF
    This paper is dedicated to the multi-ethnic and multi-lingual Habsburg realm, in particular as regards school education, its effects and the census registration of linguistic qualities among its population. After almost a century of German language dominance, national revival of the Habsburg peoples forced school education to renounce the upbringing of a supra-national and linguistic uniform leadership. Secondary and higher education gradually chose to breed new nationally conscious elites in the variety of peoples, contributing to the decomposition of the realm. Nevertheless, promotion of the ‘national languages’ resulted in wide spread bilingualism, at least among the middle and higher classes. This bilingualism, however, was restricted to the nationalities and not implemented to Austro-Germans and Magyars, who, in their own secon-dary educational institutions, stuck to a virtually unilingual practice, a fact that, in the end, weakened their political influence. This inequality has to be taken into consideration when different school types are put in a contraposition. One of the most usual ways to investigate developments in the lingual capacity of the Habs-burg subjects is found in the decennial censuses, but these are presented with rigid and di-chotomous concepts, just describing ethno-lingual identities, however, aphoristically equated with political ‘nations’. This asks for clearer definitions, and this paper advocates a critical re-consideration of national and linguistic concepts and definitions, as habitually used in Habs-burg historiography. An exposé of different educational practices in both parts—Austria and Hungary—of the realm may serve as context to this appeal.Rad se bavi multietničkim i višejezičnim aspektima Habsburške monarhije, posebice po pi-tanju školstva, njegovim učincima i podatcima o lingvističkim odrednicama stanovnika mon-arhije zabilježenim u popisu stanovništva. Nakon gotovo cijelog stoljeća dominacije nje-mačkog jezika, nacionalno buđenje naroda u Habsburškoj monarhiji prisililo je škole da se odreknu nad-nacionalnog i lingvistički uniformnog odgoja. Srednje i visokoškolsko obra-zovanje usredotočilo se na obrazovanje nove nacionalno svjesne elite iz različitih nacionalnih skupina, što je doprinijelo raspadu monarhije. Međutim, širenje “nacionalnih jezika” dovelo je do širenja bilingvizma među srednjom i višom klasom. Ovaj se bilingvizam međutim odnosio na nacionalne manjine a ne na Austri-jance, Nijemce i Mađare, koji su se u svojim obrazovnim institucijama obrazovali isključivo jednojezično, što je u konačnici dovelo do slabljenja njihovog političkog utjecaja. Ova se ne-jednakost mora uzeti u obzir kada se uspoređuju različite vrste škola. Jedan od uobičajenih načina istraživanja razvoja lingvističkih sposobnosti građana monarhije je uvid u rezultate desetogodišnjeg popisa stanovništva, u kojem su podatci predstavljeni unu-tar krutih dihotomija, te samo opisuju etno-lingvističke identitete, koji se, međutim, aforistički izjednačuju s političkim “nacijama”. To zahtijeva jasnije definicije, pa se u ovom radu zala-žemo za kritičko promišljanje nacionalnih i lingvističkih koncepata i definicija koje su korištene u habsburškoj historiografiji. Ekspoze različitih obrazovnih praksi u oba dijela mon-arhije: Austriji i Mađarskoj, može poslužiti kao kontekst za to promišljanje

    On the Effectiveness of Dataset Embeddings in Mono-lingual, Multi-lingual and Zero-shot Conditions

    Get PDF
    Recent complementary strands of research have shown that leveraging information on the data source through encoding their properties into embeddings can lead to performance increase when training a single model on heterogeneous data sources. However, it remains unclear in which situations these dataset embeddings are most effective, because they are used in a large variety of settings, languages and tasks. Furthermore, it is usually assumed that gold information on the data source is available, and that the test data is from a distribution seen during training. In this work, we compare the effect of dataset embeddings in mono-lingual settings, multi-lingual settings, and with predicted data source label in a zero-shot setting. We evaluate on three morphosyntactic tasks: morphological tagging, lemmatization, and dependency parsing, and use 104 datasets, 66 languages, and two different dataset grouping strategies. Performance increases are highest when the datasets are of the same language, and we know from which distribution the test-instance is drawn. In contrast, for setups where the data is from an unseen distribution, performance increase vanishes

    How Universal is Genre in Universal Dependencies?

    Get PDF
    This work provides the first in-depth analysis of genre in Universal Dependencies (UD). In contrast to prior work on genre identification which uses small sets of well-defined labels in mono-/bilingual setups, UD contains 18 genres with varying degrees of specificity spread across 114 languages. As most treebanks are labeled with multiple genres while lacking annotations about which instances belong to which genre, we propose four methods for predicting instance-level genre using weak supervision from treebank metadata. The proposed methods recover instance-level genre better than competitive baselines as measured on a subset of UD with labeled instances and adhere better to the global expected distribution. Our analysis sheds light on prior work using UD genre metadata for treebank selection, finding that metadata alone are a noisy signal and must be disentangled within treebanks before it can be universally applied.Comment: Accepted at SyntaxFest 202

    DAN+: Danish Nested Named Entities and Lexical Normalization

    Get PDF

    Spectral Probing

    Get PDF
    corecore