33 research outputs found
Automatic alignment of the Psalterium Sinaiticum and the Septuagint Psalms
This paper describes the work on automatically aligning the Psalterium Sinaiticum with the Septuagint psalms in the Tromsø Old Russian and OCS Treebank (TOROT). It briefly accounts for the transcription, text processing and manual annotation of the Psalterium Sinaiticum itself. It then explains the choice of Greek text, describes the automatic lemmatisation and morphological tagging of the Greek text and calculates and analyses the success rate in a small sample. Next the algorithm for automatic token-level alignment of texts is briefly described, and the success rate calculated and analysed. The results seem quite good from a quantitative perspective (over 90% accuracy in most cases), and it may seem tempting to try to use the data directly. However, a pilot study of aspect in the Greek and OCS text shows that the automatically processed Greek parallel leads to considerable data loss, and that much manual sifting of apparent mismatch examples is necessary to arrive at a preliminary analysis. In a lowresourced historical language such as Old Church Slavonic we cannot afford working with this amount of noise and data loss. We can use automatic tagging and alignment to ease our workload, but we have to manually post-correct the output
OldSlavNet: A scalable Early Slavic dependency parser trained on modern language data,
Historical languages are increasingly being modelled computationally. Syntactically annotated texts are often a sine-qua-non in their modelling, but parsing of pre-modern language varieties faces great data sparsity, intensified by high levels of orthographic variation. In this paper we present a good-quality Early Slavic dependency parser, attained via manipulation of modern Slavic data to resemble the orthography and morphosyntax of pre-modern varieties. The tool can be deployed to expand historical treebanks, which are crucial for data collection and quantification, and beneficial to downstream NLP tasks and historical text mining
Relatório de estágio em farmácia comunitária
Relatório de estágio realizado no âmbito do Mestrado Integrado em Ciências Farmacêuticas, apresentado à Faculdade de Farmácia da Universidade de Coimbr
Verbal constructional profiles: reliability, distinction power and practical applications
<p>In this paper we explore the notion of constructional profiles (the frequency distribution of a given linguistic item across syntactic environments) from two angles, methodological and applied. We concentrate on verbal constructional profiles, using Russian argument frame data in two different dependency formats. We first test the profiles' stability and distinction power across sample sizes, and then use the profiles in two tasks concerning Russian aspect: to identify the aspectual partner of a given verb and to guess whether a given verb is perfective or imperfective.</p>
Corpus (based) research and Slavic
Slavic linguistics has a long history with corpus linguistics in the widest sense of the term. Large collections of authentic texts are a prerequisite for historical linguistics, which is where Slavic linguistics started, and more recently this strand of research has yielded large historical corpora. Slavic l
Diachronic Treebanks for Historical Linguistics
Over the last few decades, the widespread diffusion of digital technology has increased availability of primary textual sources, radically changing the everyday life of scholars in the humanities, who are now able to access, query and process a wealth of empirical evidence in ways not possible before.
Also for ancient languages, corpora enhanced with increasingly complex layers of metalinguistic information, such as part-of-speech tagging and syntactic annotation (called 'treebanks') are now available. In particular, diachronic treebanks, which provide data for a language across several historical stages of a given language, allow for a new approach to diachronic studies of syntactic phenomena where scholars previously had to content themselves with empirical work on a much smaller scale.
This volume brings together a set of papers that report research on various diachronic matters supported by evidence from diachronic treebanks. The contents of the papers cover a wide range of languages, including English, French, Russian, Old Church Slavonic, Latin and Ancient Greek
Linguistics vs. digital editions: The Tromsø Old Russian and OCS Treebank
The Tromsø Old Russian and OCS Treebank (TOROT, nestor.uit.no)1 is, along with its parent treebank, the PROIEL corpus (foni.uio.no), the only existing treebank of Old Church Slavonic (OCS), Old East Slavic and Middle Russian texts. There are other tagged resources, such as the Old Russian subcorpus of the Russian National Corpus2 and the Manuskript corpus,3 but none of them, to our knowledge, currently provide syntactic annotation.
The TOROT presently contains approximately 160,000 word tokens of fully annotated OCS (Codex Marianus4 and Codex Suprasliensis), 85,000 word tokens of fully annotated Kiev-era Old East Slavic, and 60,000 word tokens of fully annotated 15th–17th-century Middle Russian. In addition, it contains the Codex Zographensis with automatic and partially hand-corrected morphological annotation and lemmatisation (sections of the Gospels missing in the Codex Marianus also have full syntactic annotation), and the PROIEL version of the Greek Gospels, with which the Codex Marianus and the Codex Zographensis are both aligned at token level (automatically, then hand-corrected)
Automatic Identification of Shared Arguments in Verbal Coordinations
We describe automatic conversion of the SynTagRus dependency treebank
of Russian to the PROIEL format (with the ultimate purpose of obtaining a single-format
diachronic treebank spanning more than a thousand years), focusing
on analysis of shared arguments in verbal coordinations. Whether arguments
are shared or private is not marked in the SynTagRus native format,
but the PROIEL format indicates sharing by means of secondary dependencies.
In order to recover missing information and insert secondary dependencies
into the converted SynTagRus, we create a simple guessing algorithm
based on four probabilistic features: how likely a given argument type
is to be shared; how likely an argument in a given position is to be shared;
how likely a given verb is to have a given argument; how likely a given verb
is to have a given argument frame. Boosted with a few deterministic rules and
trained on a small manually annotated sample (346 sentences), the guesser
very successfully inserts shared subjects (F-score 0.97), which results
in excellent overall performance (F-score 0.92). Non-subject arguments are
shared much more rarely, and for them the results are poorer (0.31 for objects;
0.22 for obliques). We show, however, that there are strong reasons
to believe that performance can be increased if a larger training sample
is used and the guesser gets to see enough positive examples. Apart from
describing a useful practical solution, the paper also provides quantitative
data about and offers non-trivial insights into Russian verbal coordination