472 research outputs found

    Towards LLOD-based language contact studies: a case study in interoperability

    Get PDF
    We describe a methodological and technical framework for conducting qualitative and quantitative studies of linguistic research questions over diverse and heterogeneous data sources such as corpora and elicitations. We demonstrate how LLOD formalisms can be employed to develop extraction pipelines for features and linguistic examples from corpora and collections of interlinear glossed text, and furthermore, how SPARQL UPDATE can be employed (1) to normalize diverse data against a reference data model (here, POWLA), (2) to harmonize annotation vocabularies by reference to terminology repositories (here, OLiA), (3) to extract examples from these normalized data structures regardless of their origin, and (4) to implement this extraction routine in a tool-independent manner for different languages with different annotation schemes. We demonstrate our approach for language contact studies for genetically unrelated, but neighboring languages from the Caucasus area, Eastern Armenian and Georgian

    Finding the Origin of a Translated Historical Document

    Get PDF
    Gospels are one type of translated historical document. There are many versions of the same Gospel that have been translated from the original, or from another Gospel that has already been translated into a different language. Nowadays, it is difficult to determine the language of the original Gospel from where these Gospels were translated. In this paper we use a super-vised machine learning technique to determine the origin of a version of the Georgian Gospel

    Computational Analysis of Morphosyntactic Categories in Georgian.

    Get PDF
    This thesis describes the development of part-of-speech tagging resources for the Georgian language, consisting of i.) a new morphosyntactic language model for part-of-speech (POS) tagging purposes; ii.) tagging guidelines for tagging and post-editing; iii.) the KATAG tagset and iv.) the trained parameter files the probabilistic TreeTagger program needs to work on Georgian texts. A new morphosyntactic model of Georgian for part-of-speech tagging purposes is described in the thesis. The thesis also describes a tagset (KATAG) defined in accordance with a new morphosyntactic model of the language and a set of design principles and tagging guidelines. A stochastic methodology is used here to perform tagging in Georgian. Namely, the Treetagger - a probabilistic part-of-speech tagging program has been trained on Georgian texts. The justification for this choice is discussed. I use two tokenisation approaches in part-of-speech tagging. An accuracy of 92.41% using an enclitic tokenisation approach and accuracy of 87.13% was achieved using a non-enclitic tokenisation approach, corroborating my hypothesis that treating enclitic elements separately from the host words results in better tagging performance. To make the tagger program easily adaptable for a range of inputs (type, variety or genre of text), the performance of the probabilistic TreeTagger program was evaluated according to the obtained test set consisting of five different genres such as academic, informal, legal, fiction and news

    Ars Edendi Lecture Series

    Get PDF
    This is the fifth and final volume of lectures on textual criticism and classical philology - broadly understood - given within the framework of the Ars edendi research programme (2008-2015). ;Two of the six papers in this volume stem from a 2015 workshop on editorial theory and method, the theme of which dealt with fragments and the writing of commentaries. As regards the former, S. Douglas Olson problematizes the creation and continuation of scholarly knowledge concerning texts that have only come down to us in a fragmentary state, emphazising the challenges and pitfalls that lay in wait for the editor. Benjamin Millis offers a nuanced homage and apology for the traditional text edition with a scholarly commentary, especially underscoring its importance as a connective pathway between text and reader as well as the impetus it can give to scholarly research. ;The other four lectures were given at the concluding conference of the Ars edendi programme, held in August 2016. In a case study Cynthia Damon shares her reflections on how to digitally edit Pliny’s Natural History in a form that will provide this work’s rich reception history and at the same time its extensive use of sources, many of which are now lost. The digital component is also prominent in Odd Einar Haugen’s contribution in which he shows that digital mark-up is also an editorial enterprise and how it can be useful for the textual scholar. Dorothea Weber gives an insider’s view of the Corpus Scriptorum Ecclesiasticorum Latinorum, an editorial project on-going since 1864, and especially how improved cataloguing has led to numerous discoveries of texts by St. Augustine. As a conclusion to the volume, David Greetham, one of the founders of the Society for Textual Scholarship, reflects on three different methods for editing texts that have undergone various degrees of rescription, namely the oeuvres of Eriugena, Coleridge, and Eliot

    UniMorph 4.0:Universal Morphology

    Get PDF

    UniMorph 4.0:Universal Morphology

    Get PDF
    The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet

    Bible as Notepad. Tracing Annotations and Annotation Practices in Late Antique and Medieval Biblical Manuscripts

    Get PDF
    The present volume provides a comparative look at the contents and layout features of secondary annotations in biblical manuscripts across linguistic traditions. Due to the privileged focus on the text in the columns, these annotations and the practices that produced them have not received the scholarly attention they deserve. The vast richness of extant verbal and figurative notes accompanying the biblical texts in the intercolumns and margins of the manuscript pages have thus been largely overlooked. The case studies gathered in this volume explore Jewish and Christian biblical manuscripts through the lens of their annotations, addressing the various relationships between the primary layer of text and the secondary notes, and exploring the roles and functions of annotated manuscripts as cultural artifacts. By approaching biblical manuscripts as potential notepads , the volume offers theoretical reflection and empirical analyses of the ways in which secondary notes may shed new light on the development and transmission of text traditions, the shifting engagement with biblical manuscripts over time, as well as the change of use and interpretation that may result from the addition of the notes themselves

    UniMorph 4.0:Universal Morphology

    Get PDF
    • …
    corecore