9 research outputs found
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
This yearâs iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VĂ”ro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, AshĂĄninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systemsâ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving \u3e90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systemsâ performance on previously unseen lemmas
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, VÔro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Ashåninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe
UniMorph 4.0:Universal Morphology
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet
Entanglements of digital technologies and Indigenous language work in the Northern Territory
This thesis addresses the question of what happens when digital language resources are developed and become entangled with different types of language work in Indigenous languages of Australia's Northern Territory. It explores three specific sociotechnical assemblages, defined as heterogeneous sets of social and technical resources functioning together for various purposes. The types of language work that emerged were the role of language in practices of documentation, pedagogy and identity-making. The three projects under consideration respond to different motivations: the Living Archive of Aboriginal Languages is a digital archive of endangered literature in languages of the Northern Territory, motivated by a concern for the fate of materials produced in bilingual education programs in remote schools. The Digital Language Shell is a resource for developing and mobilising curricula in Indigenous languages and cultures, motivated by a need for a low-cost and low-tech template for sharing content under Indigenous authority. The Bininj Kunwok online course is a specific implementation of the Digital Language Shell, teaching an Indigenous language of West Arnhem land in a university context. Each project was created by the author working collaboratively with different teams, to support various types of language work. This PhD by publication offers a set of seven academic papers, each focusing on different aspects of the projects, and written for distinct audiences. The methods entailed iterative inquiry, as I reflected on my work as project manager in developing these digital resources, first addressing the technical and practical considerations, then through the lenses of various academic disciplines, and finally in a meta-analysis of the various heterogeneous elements that make up the research. The thesis emerges as an assemblage of heterogeneities â projects, papers, concepts, academic references, and auto-ethnographic stories â that is in itself a sociotechnical assemblage
Polysynthetic sociolinguistics: the language and culture of Murrinh Patha Youth
This thesis is about the life and language
of kardu kigay â young Aboriginal men in
the town of Wadeye, northern Australia.
Kigay have attained some notoriety within
Australia for their participation in âheavy
metal gangsâ, which periodically cause
havoc in the town. But within Australianist
linguistics circles, they are additionally
known for speaking Murrinh Patha, a
polysynthetic language that has a number
of unique grammatical structures, and which
is one of the few Aboriginal languages
still being learnt by children. My core
interest is to understand how peopleâs
lives shape their language, and how their
language shapes their lives. In this thesis
these interests are focused around the
following research goals:
(1) To document the social structures of
kigayâs day-Ââto-Ââday lives, including the
subcultural âmetal gangâ dimension of their
sociality; (2) To document the language
that kigay speak, focusing in particular in
aspects of their speech that differ from what
has been documented in previous descriptions of Murrinh Patha;
(3) To analyse which features
of
kigay
speech
might
be
socially
salient
linguistic
markers,
and
which
are
more
likely
to
reflect
processes
of
grammatical
change
that
run
below
the
level
of
social
or
cognitive
salience;
(4) To
analyse
how
kigay
speech
compares
to
other
youth
Aboriginal
language
varieties
documented
in
northern
Australia,
and
argue
that
together
these
can
be
described
as
a
phenomenon
of
linguistic
urbanisation.
I
will
show
that
the
âheavy
metal
gangsâ
are
an
idiosyncratic
local
subculture
that
uses
foreign
heavy
metal
bands
as
group
totems.
Social
connections
and
loyalties
are
formed
on
the
basis
of
peer
solidarity,
as
opposed
to
the
traditional
iv
totemic
system,
which
is
structured
around
ancestry.
Lives
are
now
shaped
by
the
dense
(and
often
conflict-Ââriven)
town
environment,
as
opposed
to
bush
life,
which
was
inseparable
from
the
land.
Kigayâs
in-Ââgroup
language
is
a
âslangâ
variety
of
Murrinh
Patha
(MP),
which
deploys
new
words
and
phrases
by
borrowing
and
reinterpreting
English
vocabulary.
It
is
also
characterised
by
substantial
lenitions
and
deletions
in
the
pronunciation.
The
MP
grammatical
system
still
underlies
this
speech,
but
some
of
its
more
complex
morphosyntactic
forms
are
restricted
to
the
âheavyâ
speech
of
older
people,
and
there
are
various
mergers
and
reconfigurations
occurring
in
the
verb
morphology.
This
thesis
adds
to
the
growing
body
of
work
describing
how
language
contact
and
changing
sociolinguistic
dynamics
are
radically
restructuring
the
linguistic
repertoire
of
Aboriginal
communities
in
northern
and
central
Australia.
At
the
same
time,
it
is
one
of
very
few
studies
providing
sociolinguistic
description
of
a
polysynthetic
language,
and
is
therefore
an
innovative
study
in
polysynthetic
sociolinguistics