Search CORE

20 research outputs found

Automatic morphological analysis and interlinking of historical Irish cognate verb forms

Author: Fransen Theodorus
Publication venue: De Gruyter Mouton
Publication date: 27/10/2020
Field of study

The main aim of the author’s research project is to use computational approaches to gain more insight into the historical development of Irish verbs. One of the objectives is to investigate how a link between the electronic Dictionary of the Irish language (eDIL),1 covering the period c. 700–c. 1700, but focussing on Early Irish (7th–12th centuries), and the nascent Foclóir Stairiúil na Gaeilge ‘The Historical Dictionary of Irish’, 2 covering the period 1600–2000, could be implemented. Such a link will be hugely beneficial for scholars operating at the intersection of the medieval and modern period (see Table 1), who currently lack a comprehensive lexical resource for the “intermediate” early modern period.This paper stems from research carried out during a Government of Ireland Postgraduate Scholarship (GOIPG/2017/1808) funded by the Irish Research Council. The author would also like to acknowledge the anonymous reviewer for helpful feedback and the editors for seeing this publication through.Peer reviewe

Irish Universities

Automatic morphological analysis and interlinking of historical Irish cognate verb forms

Author: Fransen Theodorus
Publication venue: 'Walter de Gruyter GmbH'
Publication date: 01/01/2020
Field of study

PubliCatt

Irish Universities

Access to Research at National University of Ireland, Galway

Cardamom: Comparative deep models for minority and historical languages

Author: Fransen Theodorus
McCrae John Philip
Publication venue: Language Technologies for All (LT4All)
Publication date: 29/07/2020
Field of study

This paper gives an overview of the Cardamom project, which aims to close the resource gap for minority and under-resourced languages by means of deep-learning-based natural language processing (NLP) and exploiting similarities of closely-related languages. The project further extends this idea to historical languages, which can be considered as closely related to their modern form, and as such aims to provide NLP through both space and time for languages that have been ignored by current approaches

Irish Universities

Developing an inflectional lexicon for Old Irish

Author: Fransen Theodorus (ORCID:0000-0001-5639-8626)
Publication venue: place:Utrecht
Publication date: 01/01/2023
Field of study

While Old Irish (c. 600–900 A.D.) is extensively documented, it remains digitally under- resourced, lacking the range of digital resources available for other older Indo-European languages (e.g., Latin, see Pellegrini and Passarotti, 2018). We report on the development of a fully inflected lexicon of Old Irish nouns, provided in both phonemic and orthographic notation. This involved a computer-assisted, systematic, and reproducible grapheme-to- phoneme conversion pipeline and generating morphological forms through a finite-state transducer. The inflected lexicon we develop will better enable computational studies in Old Irish morphology, further research into diachronic developments, and have a wide range of Natural Language Processing (NLP) applications. We began by extracting noun lemmata from the Old Irish Würzburg glosses (Kavanagh, 2001) and the Corpus PalaeoHibernicum (CorPH) ‘Old Irish Corpus’ (Stifter et al., 2021). We then devised a set of rules for orthography-to-phonology conversion, subsequently implemented using the Python package Epitran (Mortensen, Dalmia, and Littell, 2018). The resulting transcriptions act as the data input for a finite-state transducer (FST) adapted from Fransen (2019), allowing us to generate inflected forms of Old Irish nouns. Finally, we derived orthographic forms (and their variants) by applying conversion rules to the generated forms. Old Irish presents considerable challenges for the development of a resource of this nature, given its opaque and inconsistent orthography, complex phonology, elaborate system of morphophonological alternations, and intricate patterns of morphological inflection (Anderson, 2016; Stifter, 2009; Thurneysen, 1946; Pedersen, 1909–1913). We report on how we dealt with these problems in the development of the inflectional lexicon. While this study focused on the Old Irish nouns in the Würzburg glosses, we intend to extend the lexicon by applying this pipeline to further corpora and other parts- of-speech. This inflected lexicon makes possible systematic studies in data-driven morphology and typology (Pellegrini, 2020; Beniamine, Bonami, and Luís, 2021; Beniamine, 2021), and facilitates future research into diachronic and diatopic variation in Irish and the development of further NLP applications for the language. References Anderson, Cormac (2016). “Consonant colour and vocalism in the history of Irish”. PhD thesis. Uniwersytet im. Adama Mickiewicza w Poznaniu. URL: https://hdl.handle.net/10593/14780. Beniamine, Sacha (2021). “One lexeme, many classes: inflection class systems as lattices”. In: One-to-Many Relations. Ed. by Berthold Crysmann and Manfred Sailer. Berlin: Language Science Press. Beniamine, Sacha, Olivier Bonami, and Ana R. Luís (2021). “The fine implicative structure of European Portuguese conjugation”. In: Isogloss 7.9, pp. 1–35. DOI: https://doi.org/10.5565/rev/isogloss.109. Fransen, Theodorus (2019). “Past, present and future: Computational approaches to mapping historical Irish cognate verb forms”. PhD thesis. Trinity College Dublin, The University of Dublin. URL: https://github.com/ThFransen84/OIfst. Kavanagh, Séamus (2001). A Lexicon of the Old Irish Glosses in the Würzburg Manuscript of the Epistles of St. Paul. Ed. by Dagmar S. Wodtko. Mitteilungen der Prähistorischen Kommission 45. + 1 CD-ROM. Wien: Verlag der Österreichischen Akademie der Wissenschaften. DOI: 10.1553/0x0001fb6e. Mortensen, David R., Siddharth Dalmia, and Patrick Littell (May 2018). “Epitran: Precision G2P for Many Languages”. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Ed. by Nicoletta Calzolari (Conference chair) et al. Miyazaki, Japan: European Language Resources Association (ELRA). Pedersen, Holger (1909–1913). Vergleichende Grammatik der keltischen Sprachen. 2 Vols. Göttingen: Vandenhoeck & Ruprecht. Pellegrini, Matteo (2020). “Using LatInfLexi for an Entropy-Based Assessment of Predictability in Latin Inflection”. English. In: Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages. Marseille, France: European Language Resources Association (ELRA), pp. 37–46. URL: https://aclanthology.org/2020.lt4hala-1.6. Pellegrini, Matteo and Marco Passarotti (2018). “LatInfLexi: an Inflected Lexicon of Latin Verbs”. In: Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018) (Turin, Italy, Dec. 10, 2018). Ed. by Elena Cabrio, Alessandro Mazzei, and Fabio Tamburini. Vol. 2253. CEUR Workshop Proceedings. Aachen. URL: http://ceur-ws.org/Vol-2253/paper23.pdf. Stifter, David (2009). “Early Irish”. In: The Celtic Languages. Ed. by Martin Ball and Nicole Müller. Hoboken: Routledge. Stifter, David et al. (2021). Corpus PalaeoHibernicum (CorPH) v1.0. URL: http://chronhib.maynoothuniversity.ie. Thurneysen, Rudolf (1946). A Grammar of Old Irish. Trans. by Daniel A. Binchy and Osborn Bergin. Revised and enlarged edition. Dublin: Dublin Institute for Advanced Studies. Repr. 1993, with supplement

PubliCatt

Automatic morphological parsing of Old Irish verbs using finite-state transducers

Author: Fransen Theodorus (ORCID:0000-0001-5639-8626)
Publication venue
Publication date: 01/01/2020
Field of study

The topic of this paper constitutes the main part of a recently finished Ph.D. project carried out by the author which investigates how computational methods can be employed to map cognate verb forms in Early Irish (ca. 7th–12th centuries A.D.) and Modern Irish (ca. 1200 onwards). This paper discusses the development of a finite-state morphological transducer using foma (Hulden, 2009) for the Old Irish language (ca. 7th–9th centuries A.D.), focusing on verbs. Two main challenges are discussed. First, different practices of word segmentation have significant repercussions for the encoding of dependencies both on and beyond the word level. A second challenge is complex verb stem formation and considerable stem allomorphy. This has been tackled by operating with “monolithic stem” entries for each verb lemma, i.e., synchronic, invariable hard-coded stems, representing a semi-surface-level base form

PubliCatt

Towards a computational lexical resource for the diachronic study of Irish verbs

Author: Fransen Theodorus (ORCID:0000-0001-5639-8626)
Publication venue: place:Maynooth
Publication date: 01/01/2018
Field of study

In this paper, we propose a computational framework for a lexical resource that will better facilitate diachronic study of Irish verbs. The verbal system is subject to major morphological changes between Early Irish (c. 7th-12th centuries A.D.) and Modern Irish varieties (post-12th centuries) (McCone 1997). Moreover, whereas the literary output in the Old Irish period (c. 8th-9th centuries A.D.) points to a standardised language (Stifter 2009), all post-Old Irish historical varieties, except for bardic poetry (Early Modern Irish period, c. 13th-17th centuries A.D.), show a substantial degree of grammatical, orthographical and – particularly evident in the case of Early Modern Irish prose (Ó hUiginn 2013) – stylistic variation (cf. contributions in McCone 1994). The available digital support is insufficient to systematically trace the linguistic change and variation. The research described here aims to mitigate the lack of digital support by creating and linking verb forms in morphologically annotated corpora by using a morphological analyser for contemporary, standardised Irish – already in the process of being adapted for successively earlier Modern Irish texts (UíDhonnchadha et al. 2014) – and by developing new tagging tools for Old Irish, to project forward to later forms. This paper will focus on the creation of a morphological analyser for Old Irish using finite- state morphology (Beesley and Karttunen 2003). Recognition rates for an Early Irish sample text and associated findings and challenges will be reported on. The paper concludes with an outlook on the implementation stage of the lexical resource, its benefits and potential further research. We will (a) discuss challenges in morphologically tagging and accurately linking verbal cognates across historical corpora, (b) explore the ways in which this resource can serve and advance (digital) scholarship in historical Irish philology and linguistics, and (c) address more general questions relating to the balance between computational methods and manual work in successfully linking cognate verb forms

PubliCatt

Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-Resource Languages

Author: Fransen Theodorus (ORCID:0000-0001-5639-8626)
Publication venue: place:N/A
Publication date: 01/01/2021
Field of study

We present the findings of the LoResMT 2021 shared task which focuses on machine translation (MT) of COVID-19 data for both low-resource spoken and sign languages. The organization of this task was conducted as part of the fourth workshop on technologies for machine translation of low resource languages (LoResMT). Parallel corpora is presented and publicly available which includes the following directions: English↔Irish, English↔Marathi, and Taiwanese Sign language↔Traditional Chinese. Training data consists of 8112, 20933 and 128608 segments, respectively. There are additional monolingual data sets for Marathi and English that consist of 21901 segments. The results presented here are based on entries from a total of eight teams. Three teams submitted systems for English↔Irish while five teams submitted systems for English↔Marathi. Unfortunately, there were no systems submissions for the Taiwanese Sign language↔Traditional Chinese task. Maximum system performance was computed using BLEU and follow as 36.0 for English–Irish, 34.6 for Irish–English, 24.2 for English–Marathi, and 31.3 for Marathi–English

PubliCatt

How Computers Can Future-Proof Minority Languages

Author: Fransen Theodorus (ORCID:0000-0001-5639-8626)
Publication venue
Publication date: 01/01/2021
Field of study

Dr. Theodorus Fransen & Dr. John McCrae explore how digital language tools can potentially resolve the underrepresentation of minority languages in terms of digital technology and the Web

PubliCatt