20 research outputs found
Automatic morphological analysis and interlinking of historical Irish cognate verb forms
The main aim of the authorâs research project is to use computational approaches
to gain more insight into the historical development of Irish verbs. One of the objectives is to investigate how a link between the electronic Dictionary of the Irish
language (eDIL),1 covering the period c. 700âc. 1700, but focussing on Early Irish
(7thâ12th centuries), and the nascent FoclĂłir StairiĂșil na Gaeilge âThe Historical
Dictionary of Irishâ,
2 covering the period 1600â2000, could be implemented. Such
a link will be hugely beneficial for scholars operating at the intersection of the medieval and modern period (see Table 1), who currently lack a comprehensive lexical resource for the âintermediateâ early modern period.This paper stems from research carried out during a
Government of Ireland Postgraduate Scholarship (GOIPG/2017/1808) funded by
the Irish Research Council. The author would also like to acknowledge the
anonymous reviewer for helpful feedback and the editors for seeing this publication through.Peer reviewe
Automatic morphological analysis and interlinking of historical Irish cognate verb forms
The main aim of the authorâs research project is to use computational approaches
to gain more insight into the historical development of Irish verbs. One of the objectives is to investigate how a link between the electronic Dictionary of the Irish
language (eDIL),1 covering the period c. 700âc. 1700, but focussing on Early Irish
(7thâ12th centuries), and the nascent FoclĂłir StairiĂșil na Gaeilge âThe Historical
Dictionary of Irishâ,
2 covering the period 1600â2000, could be implemented. Such
a link will be hugely beneficial for scholars operating at the intersection of the medieval and modern period (see Table 1), who currently lack a comprehensive lexical resource for the âintermediateâ early modern period.This paper stems from research carried out during a
Government of Ireland Postgraduate Scholarship (GOIPG/2017/1808) funded by
the Irish Research Council. The author would also like to acknowledge the
anonymous reviewer for helpful feedback and the editors for seeing this publication through.Peer reviewe
Cardamom: Comparative deep models for minority and historical languages
This paper gives an overview of the Cardamom project, which aims to close the resource gap for minority and under-resourced languages
by means of deep-learning-based natural language processing (NLP) and exploiting similarities of closely-related languages. The project
further extends this idea to historical languages, which can be considered as closely related to their modern form, and as such aims to
provide NLP through both space and time for languages that have been ignored by current approaches
Developing an inflectional lexicon for Old Irish
While Old Irish (c. 600â900 A.D.) is extensively documented, it remains digitally under-
resourced, lacking the range of digital resources available for other older Indo-European
languages (e.g., Latin, see Pellegrini and Passarotti, 2018). We report on the development
of a fully inflected lexicon of Old Irish nouns, provided in both phonemic and orthographic
notation. This involved a computer-assisted, systematic, and reproducible grapheme-to-
phoneme conversion pipeline and generating morphological forms through a finite-state
transducer. The inflected lexicon we develop will better enable computational studies in
Old Irish morphology, further research into diachronic developments, and have a wide
range of Natural Language Processing (NLP) applications.
We began by extracting noun lemmata from the Old Irish WuÌrzburg glosses (Kavanagh,
2001) and the Corpus PalaeoHibernicum (CorPH) âOld Irish Corpusâ (Stifter et al., 2021). We
then devised a set of rules for orthography-to-phonology conversion, subsequently
implemented using the Python package Epitran (Mortensen, Dalmia, and Littell, 2018). The
resulting transcriptions act as the data input for a finite-state transducer (FST) adapted
from Fransen (2019), allowing us to generate inflected forms of Old Irish nouns. Finally,
we derived orthographic forms (and their variants) by applying conversion rules to the
generated forms.
Old Irish presents considerable challenges for the development of a resource of this
nature, given its opaque and inconsistent orthography, complex phonology, elaborate
system of morphophonological alternations, and intricate patterns of morphological
inflection (Anderson, 2016; Stifter, 2009; Thurneysen, 1946; Pedersen, 1909â1913). We
report on how we dealt with these problems in the development of the inflectional
lexicon. While this study focused on the Old Irish nouns in the WuÌrzburg glosses, we
intend to extend the lexicon by applying this pipeline to further corpora and other parts-
of-speech. This inflected lexicon makes possible systematic studies in data-driven
morphology and typology (Pellegrini, 2020; Beniamine, Bonami, and LuiÌs, 2021;
Beniamine, 2021), and facilitates future research into diachronic and diatopic variation in
Irish and the development of further NLP applications for the language.
References
Anderson, Cormac (2016). âConsonant colour and vocalism in the history of Irishâ. PhD
thesis. Uniwersytet im. Adama Mickiewicza w Poznaniu. URL:
https://hdl.handle.net/10593/14780.
Beniamine, Sacha (2021). âOne lexeme, many classes: inflection class systems as latticesâ.
In: One-to-Many Relations. Ed. by Berthold Crysmann and Manfred Sailer. Berlin:
Language Science Press.
Beniamine, Sacha, Olivier Bonami, and Ana R. LuiÌs (2021). âThe fine implicative structure
of European Portuguese conjugationâ. In: Isogloss 7.9, pp. 1â35. DOI:
https://doi.org/10.5565/rev/isogloss.109.
Fransen, Theodorus (2019). âPast, present and future: Computational approaches to
mapping historical Irish cognate verb formsâ. PhD thesis. Trinity College Dublin,
The University of Dublin. URL: https://github.com/ThFransen84/OIfst.
Kavanagh, SeÌamus (2001). A Lexicon of the Old Irish Glosses in the WuÌrzburg Manuscript of
the Epistles of St. Paul. Ed. by Dagmar S. Wodtko. Mitteilungen der PraÌhistorischen
Kommission 45. + 1 CD-ROM. Wien: Verlag der OÌsterreichischen Akademie der
Wissenschaften. DOI: 10.1553/0x0001fb6e.
Mortensen, David R., Siddharth Dalmia, and Patrick Littell (May 2018). âEpitran: Precision
G2P for Many Languagesâ. In: Proceedings of the Eleventh International Conference
on Language Resources and Evaluation (LREC 2018). Ed. by Nicoletta Calzolari
(Conference chair) et al. Miyazaki, Japan: European Language Resources
Association (ELRA).
Pedersen, Holger (1909â1913). Vergleichende Grammatik der keltischen Sprachen. 2 Vols.
GoÌttingen: Vandenhoeck & Ruprecht.
Pellegrini, Matteo (2020). âUsing LatInfLexi for an Entropy-Based Assessment of
Predictability in Latin Inflectionâ. English. In: Proceedings of LT4HALA 2020 - 1st
Workshop on Language Technologies for Historical and Ancient Languages. Marseille,
France: European Language Resources Association (ELRA), pp. 37â46. URL:
https://aclanthology.org/2020.lt4hala-1.6.
Pellegrini, Matteo and Marco Passarotti (2018). âLatInfLexi: an Inflected Lexicon of Latin
Verbsâ. In: Proceedings of the Fifth Italian Conference on Computational Linguistics
(CLiC-it 2018) (Turin, Italy, Dec. 10, 2018). Ed. by Elena Cabrio, Alessandro Mazzei,
and Fabio Tamburini. Vol. 2253. CEUR Workshop Proceedings. Aachen. URL:
http://ceur-ws.org/Vol-2253/paper23.pdf.
Stifter, David (2009). âEarly Irishâ. In: The Celtic Languages. Ed. by Martin Ball and Nicole
MuÌller. Hoboken: Routledge.
Stifter, David et al. (2021). Corpus PalaeoHibernicum (CorPH) v1.0. URL:
http://chronhib.maynoothuniversity.ie.
Thurneysen, Rudolf (1946). A Grammar of Old Irish. Trans. by Daniel A. Binchy and Osborn
Bergin. Revised and enlarged edition. Dublin: Dublin Institute for Advanced
Studies. Repr. 1993, with supplement
Automatic morphological parsing of Old Irish verbs using finite-state transducers
The topic of this paper constitutes the main part of a recently finished Ph.D. project carried out by the author which investigates how computational methods can be employed to map cognate verb forms in Early Irish (ca. 7thâ12th centuries A.D.) and Modern Irish (ca. 1200 onwards). This paper discusses the development of a finite-state morphological transducer using foma (Hulden, 2009) for the Old Irish language (ca. 7thâ9th centuries A.D.), focusing on verbs. Two main challenges are discussed. First, different practices of word segmentation have significant repercussions for the encoding of dependencies both on and beyond the word level. A second challenge is complex verb stem formation and considerable stem allomorphy. This has been tackled by operating with âmonolithic stemâ entries for each verb lemma, i.e., synchronic, invariable hard-coded stems, representing a semi-surface-level base form
Towards a computational lexical resource for the diachronic study of Irish verbs
In this paper, we propose a computational framework for a lexical resource that will better facilitate
diachronic study of Irish verbs. The verbal system is subject to major morphological changes between
Early Irish (c. 7th-12th centuries A.D.) and Modern Irish varieties (post-12th centuries) (McCone
1997). Moreover, whereas the literary output in the Old Irish period (c. 8th-9th centuries A.D.) points
to a standardised language (Stifter 2009), all post-Old Irish historical varieties, except for bardic
poetry (Early Modern Irish period, c. 13th-17th centuries A.D.), show a substantial degree of
grammatical, orthographical and â particularly evident in the case of Early Modern Irish prose (OÌ
hUiginn 2013) â stylistic variation (cf. contributions in McCone 1994). The available digital support
is insufficient to systematically trace the linguistic change and variation.
The research described here aims to mitigate the lack of digital support by creating and
linking verb forms in morphologically annotated corpora by using a morphological analyser for
contemporary, standardised Irish â already in the process of being adapted for successively earlier
Modern Irish texts (UiÌDhonnchadha et al. 2014) â and by developing new tagging tools for Old Irish,
to project forward to later forms.
This paper will focus on the creation of a morphological analyser for Old Irish using finite-
state morphology (Beesley and Karttunen 2003). Recognition rates for an Early Irish sample text and
associated findings and challenges will be reported on. The paper concludes with an outlook on the
implementation stage of the lexical resource, its benefits and potential further research. We will (a)
discuss challenges in morphologically tagging and accurately linking verbal cognates across historical
corpora, (b) explore the ways in which this resource can serve and advance (digital) scholarship in
historical Irish philology and linguistics, and (c) address more general questions relating to the
balance between computational methods and manual work in successfully linking cognate verb forms
Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-Resource Languages
We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following directions: EnglishâIrish, EnglishâMarathi, and Taiwanese Sign languageâTraditional Chinese. Training data consists of 8112, 20933 and 128608 segments, respectively. There are additional monolingual data sets for Marathi and English that consist of 21901 segments. The results presented here are based on entries from a total of eight teams. Three teams submitted systems for EnglishâIrish while five teams submitted systems for EnglishâMarathi. Unfortunately, there were no systems submissions for the Taiwanese Sign languageâTraditional Chinese task. Maximum system performance was computed using BLEU and follow as 36.0 for EnglishâIrish, 34.6 for IrishâEnglish, 24.2 for EnglishâMarathi, and 31.3 for MarathiâEnglish
How Computers Can Future-Proof Minority Languages
Dr. Theodorus Fransen & Dr. John McCrae explore how digital language tools can potentially resolve the underrepresentation of minority languages in terms of digital technology and the Web