9 research outputs found

    SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

    Get PDF
    This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VĂ”ro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, AshĂĄninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving \u3e90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas

    SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

    Get PDF
    This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VÔro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Ashåninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF
    The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet

    UniMorph 4.0:Universal Morphology

    Get PDF

    Entanglements of digital technologies and Indigenous language work in the Northern Territory

    Get PDF
    This thesis addresses the question of what happens when digital language resources are developed and become entangled with different types of language work in Indigenous languages of Australia's Northern Territory. It explores three specific sociotechnical assemblages, defined as heterogeneous sets of social and technical resources functioning together for various purposes. The types of language work that emerged were the role of language in practices of documentation, pedagogy and identity-making. The three projects under consideration respond to different motivations: the Living Archive of Aboriginal Languages is a digital archive of endangered literature in languages of the Northern Territory, motivated by a concern for the fate of materials produced in bilingual education programs in remote schools. The Digital Language Shell is a resource for developing and mobilising curricula in Indigenous languages and cultures, motivated by a need for a low-cost and low-tech template for sharing content under Indigenous authority. The Bininj Kunwok online course is a specific implementation of the Digital Language Shell, teaching an Indigenous language of West Arnhem land in a university context. Each project was created by the author working collaboratively with different teams, to support various types of language work. This PhD by publication offers a set of seven academic papers, each focusing on different aspects of the projects, and written for distinct audiences. The methods entailed iterative inquiry, as I reflected on my work as project manager in developing these digital resources, first addressing the technical and practical considerations, then through the lenses of various academic disciplines, and finally in a meta-analysis of the various heterogeneous elements that make up the research. The thesis emerges as an assemblage of heterogeneities – projects, papers, concepts, academic references, and auto-ethnographic stories – that is in itself a sociotechnical assemblage

    Polysynthetic sociolinguistics: the language and culture of Murrinh Patha Youth

    Get PDF
    This thesis is about the life and language of kardu kigay – young Aboriginal men in the town of Wadeye, northern Australia. Kigay have attained some notoriety within Australia for their participation in “heavy metal gangs”, which periodically cause havoc in the town. But within Australianist linguistics circles, they are additionally known for speaking Murrinh Patha, a polysynthetic language that has a number of unique grammatical structures, and which is one of the few Aboriginal languages still being learnt by children. My core interest is to understand how people’s lives shape their language, and how their language shapes their lives. In this thesis these interests are focused around the following research goals: (1) To document the social structures of kigay’s day-­‐to-­‐day lives, including the subcultural “metal gang” dimension of their sociality; (2) To document the language that kigay speak, focusing in particular in aspects of their speech that differ from what has been documented in previous descriptions of Murrinh Patha; (3) To analyse which features of kigay speech might be socially salient linguistic markers, and which are more likely to reflect processes of grammatical change that run below the level of social or cognitive salience; (4) To analyse how kigay speech compares to other youth Aboriginal language varieties documented in northern Australia, and argue that together these can be described as a phenomenon of linguistic urbanisation. I will show that the “heavy metal gangs” are an idiosyncratic local subculture that uses foreign heavy metal bands as group totems. Social connections and loyalties are formed on the basis of peer solidarity, as opposed to the traditional iv totemic system, which is structured around ancestry. Lives are now shaped by the dense (and often conflict-­‐riven) town environment, as opposed to bush life, which was inseparable from the land. Kigay’s in-­‐group language is a “slang” variety of Murrinh Patha (MP), which deploys new words and phrases by borrowing and reinterpreting English vocabulary. It is also characterised by substantial lenitions and deletions in the pronunciation. The MP grammatical system still underlies this speech, but some of its more complex morphosyntactic forms are restricted to the “heavy” speech of older people, and there are various mergers and reconfigurations occurring in the verb morphology. This thesis adds to the growing body of work describing how language contact and changing sociolinguistic dynamics are radically restructuring the linguistic repertoire of Aboriginal communities in northern and central Australia. At the same time, it is one of very few studies providing sociolinguistic description of a polysynthetic language, and is therefore an innovative study in polysynthetic sociolinguistics
    corecore