36 research outputs found

    Neural morphosyntactic tagging for Rusyn

    Get PDF
    The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages.We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.Peer reviewe

    Handwritten Text Recognition for Croatian Glagolitic

    Get PDF
    U radu se predstavljaju nedavni pomaci u tehnologiji prepoznavanja rukopisnoga teksta (HTR) namijenjenoj hrvatskoglagoljskim rukopisnim i ranim tiskanim knjigama. Nakon opisivanja općih načela strojne obuke HTR modela, iznose se značajke načela strojnoga učenja u platformi Transkribus, pogotovo modeli korištenja latinice u preslovljavanju glagoljskih tekstova. Pri tome se u većini slučajeva ispravno preslovljavaju ligature i razrješuju kratice. Dobivena čestota pogrešaka je manja od 6%, poput uobičajene čestote pogrešaka kada preslovljavanje provode stručne osobe. Primjena HTR modela u prvom stadiju preslovljavanja može uštedjeti puno vremena pri pripremi i uređivanju rukopisa za objavu, zahvaljujući pretraživanju (pretrazi po ključnim riječima), pa čak i neispravno HTR preslovljavanje može biti korišteno za različite raščlambe. Modeli su javno dostupni posredstvom platforme Transkribus. Potičemo sve znanstvenike koji obrađuju glagoljske rukopise i rane tiskane knjige da se njima koriste.The paper presents and discusses recent advances in Handwritten Text Recognition (HTR) technologies for handwritten and early printed texts in Croatian Glagolitic script. After elaborating on the general principles of training HTR models with respect to the Transkribus platform used for these experiments, the characteristics of the models trained are discussed. Specifically, the models use the Latin script to transcribe the Glagolitic source. In doing so, they transcribe ligatures and resolve abbreviations correctly in the majority of cases. The computed error rate of the models is below 6%, real-world performance seems to be similar. Using the models for pre-transcription can save a great amount of time when editing manuscripts and, thanks to fuzzy search (keyword spotting), even uncorrected HTR transcriptions can be used for various kinds of analysis. The models are publicly available via the Transkribus platform. Every scholar working on Glagolitic manuscripts and early printings is encouraged to use them

    New Developments in Tagging Pre-modern Orthodox Slavic Texts

    Get PDF
    Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.Peer reviewe

    Elternbefragung zum gymnasialen Russischunterricht mit Herkunftssprachenlernenden: Erwartungen zu Motivation, sprachlichen Kompetenzen und kultureller Identität

    Get PDF
    Vorgestellt wird eine Studie zu den Erwartungen der Eltern von Kindern mit russischsprachigem Hintergrund, welche den gymnasialen Russischunterricht besuchen. Dabei wurde die Motivation der Eltern erfragt sowie die Erwartungen an die Entwicklung der Sprachkompetenzen der Lernenden und an die Rolle der Lehrperson. Zusätzlich wurde ermittelt, welche Unterstützung Eltern bei der Auf-rechterhaltung der russischen Herkunftssprache und –kultur für ihre Kinder bislang in Anspruch genommen haben.3. Tagung des Arbeitskreises Didaktik der Slawischen Sprachen, Berlin, 19.-20.02.202

    TerraSAR-X Time Series Fill a Gap in Spaceborne Snowmelt Monitoring of Small Arctic Catchments—A Case Study on Qikiqtaruk(Herschel Island), Canada

    Get PDF
    The timing of snowmelt is an important turning point in the seasonal cycle of small Arctic catchments. The TerraSAR-X (TSX) satellite mission is a synthetic aperture radar system (SAR) with high potential to measure the high spatiotemporal variability of snow cover extent (SCE) and fractional snow cover (FSC) on the small catchment scale. We investigate the performance of multi-polarized and multi-pass TSX X-Band SAR data in monitoring SCE and FSC in small Arctic tundra catchments of Qikiqtaruk (Herschel Island) off the Yukon Coast in the Western Canadian Arctic. We applied a threshold based segmentation on ratio images between TSX images with wet snow and a dry snow reference, and tested the performance of two different thresholds. We quantitatively compared TSX- and Landsat 8-derived SCE maps using confusion matrices and analyzed the spatiotemporal dynamics of snowmelt from 2015 to 2017 using TSX, Landsat 8 and in situ time lapse data. Our data showed that the quality of SCE maps from TSX X-Band data is strongly influenced by polarization and to a lesser degree by incidence angle. VH polarized TSX data performed best in deriving SCE when compared to Landsat 8. TSX derived SCE maps from VH polarization detected late lying snow patches that were not detected by Landsat 8. Results of a local assessment of TSX FSC against the in situ data showed that TSX FSC accurately captured the temporal dynamics of different snow melt regimes that were related to topographic characteristics of the studied catchments. Both in situ and TSX FSC showed a longer snowmelt period in a catchment with higher contributions of steep valleys and a shorter snowmelt period in a catchment with higher contributions of upland terrain. Landsat 8 had fundamental data gaps during the snowmelt period in all 3 years due to cloud cover. The results also revealed that by choosing a positive threshold of 1 dB, detection of ice layers due to diurnal temperature variations resulted in a more accurate estimation of snow cover than a negative threshold that detects wet snow alone. We find that TSX X-Band data in VH polarization performs at a comparable quality to Landsat 8 in deriving SCE maps when a positive threshold is used. We conclude that TSX data polarization can be used to accurately monitor snowmelt events at high temporal and spatial resolution, overcoming limitations of Landsat 8, which due to cloud related data gaps generally only indicated the onset and end of snowmelt

    Exploring data provenance in handwritten text recognition infrastructure:Sharing and reusing ground truth data, referencing models, and acknowledging contributions. Starting the conversation on how we could get it done

    Get PDF
    This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, and ways to reference and acknowledge contributions to the creation and enrichment of data within these Machine Learning systems. We discuss how one can publish Ground Truth data in a repository and, subsequently, inform others. Furthermore, we suggest appropriate citation methods for HTR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of Machine Learning in archival and library contexts, and how the community should begin toacknowledge and record both contributions and data provenance
    corecore