9,005 research outputs found

    Open Source Dataset and Machine Learning Techniques for Automatic Recognition of Historical Graffiti

    Full text link
    Machine learning techniques are presented for automatic recognition of the historical letters (XI-XVIII centuries) carved on the stoned walls of St.Sophia cathedral in Kyiv (Ukraine). A new image dataset of these carved Glagolitic and Cyrillic letters (CGCL) was assembled and pre-processed for recognition and prediction by machine learning methods. The dataset consists of more than 4000 images for 34 types of letters. The explanatory data analysis of CGCL and notMNIST datasets shown that the carved letters can hardly be differentiated by dimensionality reduction methods, for example, by t-distributed stochastic neighbor embedding (tSNE) due to the worse letter representation by stone carving in comparison to hand writing. The multinomial logistic regression (MLR) and a 2D convolutional neural network (CNN) models were applied. The MLR model demonstrated the area under curve (AUC) values for receiver operating characteristic (ROC) are not lower than 0.92 and 0.60 for notMNIST and CGCL, respectively. The CNN model gave AUC values close to 0.99 for both notMNIST and CGCL (despite the much smaller size and quality of CGCL in comparison to notMNIST) under condition of the high lossy data augmentation. CGCL dataset was published to be available for the data science community as an open source resource.Comment: 11 pages, 9 figures, accepted for 25th International Conference on Neural Information Processing (ICONIP 2018), 14-16 December, 2018 (Siem Reap, Cambodia

    Remarks on the Letter of the Patriarch Theophylact to Tsar Peter in the Context of Certain Byzantine and Slavic Anti-heretic Texts

    Get PDF
    Translated by Marek MajerThe Letter of patriarch Theophylact to tsar Peter is the oldest, but seemingly not the most informative Greek source for the history of Bogomilism. It is in essence a standard document, a typical product of the patriarch’s chancery; it is not conceived as an in-depth investigation into the theological minutiae pertaining to the cosmogony, dogmas and social doctrines of the heretics and the orthodox Church, but rather as a practical tutorial on how to thwart any given neo-Manichaean dualist heresy. It brings to light the fact that Bogomilism, the ‘new’ heresy was treated as an ‘old’ one – as a ‘reactivation’ of earlier gnostic-dualist and neo-Manichaean movements. The letter also features a peculiar innovative feature, though not one directly related to the Bogomil heresy itself: the degree of commitment to preaching the dogmas of the heresy is used for differentiating the situation of the followers. The analysis of the Letter of patriarch Theophylact to tsar Peter raises the more general issue concerning the detailed study of Byzantine and Slavic liturgical texts as a source of information on neo-Manichaean doctrines

    Recognizing Degraded Handwritten Characters

    Get PDF
    In this paper, Slavonic manuscripts from the 11th century written in Glagolitic script are investigated. State-of-the-art optical character recognition methods produce poor results for degraded handwritten document images. This is largely due to a lack of suitable results from basic pre-processing steps such as binarization and image segmentation. Therefore, a new, binarization-free approach will be presented that is independent of pre-processing deficiencies. It additionally incorporates local information in order to recognize also fragmented or faded characters. The proposed algorithm consists of two steps: character classification and character localization. Firstly scale invariant feature transform features are extracted and classified using support vector machines. On this basis interest points are clustered according to their spatial information. Then, characters are localized and eventually recognized by a weighted voting scheme of pre-classified local descriptors. Preliminary results show that the proposed system can handle highly degraded manuscript images with background noise, e.g. stains, tears, and faded characters

    Оцифровка кириллических рукописей для исторического словаря сербского языка с использованием технологии распознавания рукописного текста

    Get PDF
    The paper explores the possibilities of using information technologies based on the principles of machine learning and artificial intelligence in the process of digitizing Cyrillic manuscripts for the purposes of creating a historical dictionary of the Serbian language. Empirical research is based on the use of the Transkribus software platform in the creation of a model for automatic text recognition of the manuscripts by Gavril Stefanović Venclović, the most significant and prolific Serbian cultural enthusiast of the 18th century, whose extensive manuscript legacy in Serbian vernacular represents the most significant primary source for the historical dictionary of the Serbian language of this period. Following the results of conducted research, it can be concluded that the process of digitizing Cyrillic manuscripts for the purposes of creating a historical dictionary of the Serbian language can be significantly accelerated using Transkribus by creating specific and generic models for automatic text recognition. The advantage of automatic text recognition compared to the traditional methods is particularly reflected in the possibility of continuous improvement of the performance of specific and generic models in accordance with the progress of the transcription process and the increase in the amount of digitized text that can be used to train a new version of the model. DOI: 10.31168/2305-6754.2023.1.08В статье исследуются возможности использования информационных технологий, основанных на принципах машинного обучения и искусственного интеллекта, в процессе оцифровки кириллических рукописей в целях создания исторического словаря сербского языка. Эмпирическое исследование основано на использовании программной платформы Transkribus при создании модели автоматического распознавания текста рукописей Гаврила Стефановича Венцловича, самого значительного и плодовитого сербского культурного энтузиаста XVIII века, чье обширное рукописное наследие в сербском народном языке представляет собой наиболее значительный первоисточник исторического словаря сербского языка, относящегося к этому периоду. По результатам проведенного исследования можно сделать вывод, что процесс оцифровки кириллических рукописей в целях создания исторического словаря сербского языка можно значительно ускорить с помощью Transkribus через создание определенных и генерических моделей для автоматического распознавания текста. Преимущество автоматического распознавания текста по сравнению с традиционным, в частности, выражается в возможности постоянного улучшения производительности определенных и генерических моделей в соответствии с ходом процесса транскрипции и увеличением объема оцифрованного текста, который можно использовать для обучения новой версии модели. DOI: 10.31168/2305-6754.2023.1.0

    Ethics and politics of Great Moravia of the 9th century

    Get PDF
    The author studies the role of Christianity in two forms of 9th century political ethics in the history of Great Moravia, represented by the Great Moravian rulers Rastislav and Svatopluk. Rastislav’s conception predominantly uses the pre-Erasmian model of political ethics based on the pursuit of welfare for the country and its inhabitants by achieving the clerical-political independence of Great Moravia from the Frankish kingdom and, moreover, by utilising Christianity for the advancement of culture, education, literature, law and legality, as well as by spreading Christian ethics and morality in the form of the Christian code of ethics expressed in ethicallegal documents. Svatopluk’s political conception was a prototype of Machiavellian political ethics, according to which one is, in the interest of the country and its power and fame, allowed to be a lion and/or a fox. Svatopluk abused Christianity in the name of achieving his power-oriented goals. Great Moravia outlived Rastislav; it did not, however, outlive Svatopluk, as, shortly after his death, it broke up and ceased to exist. The author came to the conclusion that Rastislav’s conception was more viable, as its cultural heritage lives on in the form of works by Constantine and Methodius

    Reading Polish with Czech Eyes: Distance and Surprisal in Quantitative, Qualitative, and Error Analyses of Intelligibility

    Get PDF
    In CHAPTER I, I first introduce the thesis in the context of the project workflow in section 1. I then summarise the methods and findings from the project publications about the languages in focus. There I also introduce the relevant concepts and terminology viewed in the literature as possible predictors of intercomprehension and processing difficulty. CHAPTER II presents a quantitative (section 4) and a qualitative (section 5) analysis of the results of the cooperative translation experiments. The focus of this thesis – the language pair PL-CS – is explained and the hypotheses are introduced in section 6. The experiment website is introduced in section 7 with an overview over participants, the different experiments conducted and in which section they are discussed. In CHAPTER IV, free translation experiments are discussed in which two different sets of individual word stimuli were presented to Czech readers: (i) Cognates that are transformable with regular PL-CS correspondences (section 12) and (ii) the 100 most frequent PL nouns (section 13). CHAPTER V presents the findings of experiments in which PL NPs in two different linearisation conditions were presented to Czech readers (section 14.1-14.6). A short digression is made when I turn to experiments with PL internationalisms which were presented to German readers (14.7). CHAPTER VI discusses the methods and results of cloze translation experiments with highly predictable target words in sentential context (section 15) and random context with sentences from the cooperative translation experiments (section 16). A final synthesis of the findings, together with an outlook, is provided in CHAPTER VII.In KAPITEL I stelle ich zunächst die These im Kontext des Projektablaufs in Abschnitt 1 vor. Anschließend fasse ich die Methoden und Erkenntnisse aus den Projektpublikationen zu den untersuchten Sprachen zusammen. Dort stelle ich auch die relevanten Konzepte und die Terminologie vor, die in der Literatur als mögliche Prädiktoren für Interkomprehension und Verarbeitungsschwierigkeiten angesehen werden. KAPITEL II enthält eine quantitative (Abschnitt 4) und eine qualitative (Abschnitt 5) Analyse der Ergebnisse der kooperativen Übersetzungsexperimente. Der Fokus dieser Arbeit - das Sprachenpaar PL-CS - wird erläutert und die Hypothesen werden in Abschnitt 6 vorgestellt. Die Experiment-Website wird in Abschnitt 7 mit einer Übersicht über die Teilnehmer, die verschiedenen durchgeführten Experimente und die Abschnitte, in denen sie besprochen werden, vorgestellt. In KAPITEL IV werden Experimente zur freien Übersetzung besprochen, bei denen tschechischen Lesern zwei verschiedene Sätze einzelner Wortstimuli präsentiert wurden: (i) Kognaten, die mit regulären PL-CS-Korrespondenzen umgewandelt werden können (Abschnitt 12) und (ii) die 100 häufigsten PL-Substantive (Abschnitt 13). KAPITEL V stellt die Ergebnisse von Experimenten vor, in denen tschechischen Lesern PL-NP in zwei verschiedenen Linearisierungszuständen präsentiert wurden (Abschnitt 14.1-14.6). Einen kurzen Exkurs mache ich, wenn ich mich den Experimenten mit PL-Internationalismen zuwende, die deutschen Lesern präsentiert wurden (14.7). KAPITEL VI erörtert die Methoden und Ergebnisse von Lückentexten mit hochgradig vorhersehbaren Zielwörtern im Satzkontext (Abschnitt 15) und Zufallskontext mit Sätzen aus den kooperativen Übersetzungsexperimenten (Abschnitt 16). Eine abschließende Synthese der Ergebnisse und ein Ausblick finden sich in KAPITEL VII

    Artificial Sequences and Complexity Measures

    Get PDF
    In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding behind them. We consider as a case study linguistic motivated problems and we present results for automatic language recognition, authorship attribution and self consistent-classification.Comment: Revised version, with major changes, of previous "Data Compression approach to Information Extraction and Classification" by A. Baronchelli and V. Loreto. 15 pages; 5 figure

    Национальный литературный канон при переводе

    Get PDF
    В статье рассматриваются проблемы литературного перевода, рецепция переводного текста, трансформация национального литературного канона и судьба литературного произведения в переводящей культуре
    corecore