Search CORE

36 research outputs found

Recommended from our members

Proposal to encode nine Cyrillic characters for Slavonic

Author: Baranov Victor
Everson Michael
Miklas Heinz
Rabus Achim
Publication venue: eScholarship, University of California
Publication date: 21/01/2010
Field of study

This is a proposal to add several Cyrillic characters to the international character encoding standard Unicode. These additions were published in Unicode Standard version 6.1 in January 2012. This proposal includes characters that occur in medieval Church Slavonic manuscripts from the 10/11c to 17c CE

eScholarship - University of California

Neural morphosyntactic tagging for Rusyn

Author: Rabus Achim
Scherrer Yves
Publication venue
Publication date: 01/09/2019
Field of study

The paper presents experiments on part-of-speech and full morphological tagging of the Slavic minority language Rusyn. The proposed approach relies on transfer learning and uses only annotated resources from related Slavic languages, namely Russian, Ukrainian, Slovak, Polish, and Czech. It does not require any annotated Rusyn training data, nor parallel data or bilingual dictionaries involving Rusyn. Compared to earlier work, we improve tagging performance by using a neural network tagger and larger training data from the neighboring Slavic languages.We experiment with various data preprocessing and sampling strategies and evaluate the impact of multitask learning strategies and of pretrained word embeddings. Overall, while genre discrepancies between training and test data have a negative impact, we improve full morphological tagging by 9% absolute micro-averaged F1 as compared to previous research.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Девятая встреча палеославистов, Фрайбург-в-Брайсгау, 19–20 октября 2012 г.

Author: Rabus Achim
Publication venue: Slověne = Словѣне. International Journal of Slavic Studies
Publication date: 19/12/2012
Field of study

Slověne = Словѣне. International Journal of Slavic Studies (Institute for Slavic Studies of the Russian Academy of Sciences)

Handwritten Text Recognition for Croatian Glagolitic

Author: Rabus Achim
Publication venue: 'Staroslavenski institut'
Publication date: 01/01/2022
Field of study

U radu se predstavljaju nedavni pomaci u tehnologiji prepoznavanja rukopisnoga teksta (HTR) namijenjenoj hrvatskoglagoljskim rukopisnim i ranim tiskanim knjigama. Nakon opisivanja općih načela strojne obuke HTR modela, iznose se značajke načela strojnoga učenja u platformi Transkribus, pogotovo modeli korištenja latinice u preslovljavanju glagoljskih tekstova. Pri tome se u većini slučajeva ispravno preslovljavaju ligature i razrješuju kratice. Dobivena čestota pogrešaka je manja od 6%, poput uobičajene čestote pogrešaka kada preslovljavanje provode stručne osobe. Primjena HTR modela u prvom stadiju preslovljavanja može uštedjeti puno vremena pri pripremi i uređivanju rukopisa za objavu, zahvaljujući pretraživanju (pretrazi po ključnim riječima), pa čak i neispravno HTR preslovljavanje može biti korišteno za različite raščlambe. Modeli su javno dostupni posredstvom platforme Transkribus. Potičemo sve znanstvenike koji obrađuju glagoljske rukopise i rane tiskane knjige da se njima koriste.The paper presents and discusses recent advances in Handwritten Text Recognition (HTR) technologies for handwritten and early printed texts in Croatian Glagolitic script. After elaborating on the general principles of training HTR models with respect to the Transkribus platform used for these experiments, the characteristics of the models trained are discussed. Specifically, the models use the Latin script to transcribe the Glagolitic source. In doing so, they transcribe ligatures and resolve abbreviations correctly in the majority of cases. The computed error rate of the models is below 6%, real-world performance seems to be similar. Using the models for pre-transcription can save a great amount of time when editing manuscripts and, thanks to fuzzy search (keyword spotting), even uncorrected HTR transcriptions can be used for various kinds of analysis. The models are publicly available via the Transkribus platform. Every scholar working on Glagolitic manuscripts and early printings is encouraged to use them

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

New Developments in Tagging Pre-modern Orthodox Slavic Texts

Author: Mocken Susanne
Rabus Achim
Scherrer Yves
Publication venue
Publication date: 01/01/2018
Field of study

Pre-modern Orthodox Slavic texts pose certain difficulties when it comes to part-of-speech and full morphological tagging. Orthographic and morphological heterogeneity makes it hard to apply resources that rely on normalized data, which is why previous attempts to train part-of-speech (POS) taggers for pre-modern Slavic often apply normalization routines. In the current paper, we further explore the normalization path; at the same time, we use the statistical CRF-tagger MarMoT and a newly developed neural network tagger that cope better with variation than previously applied rule-based or statistical taggers. Furthermore, we conduct transfer experiments to apply Modern Russian resources to pre-modern data. Our experiments show that while transfer experiments could not improve tagging performance significantly, state-of-the-art taggers reach between 90% and more than 95% tagging accuracy and thus approach the tagging accuracy of modern standard languages with rich morphology. Remarkably, these results are achieved without the need for normalization, which makes our research of practical relevance to the Paleoslavistic community.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Elternbefragung zum gymnasialen Russischunterricht mit Herkunftssprachenlernenden: Erwartungen zu Motivation, sprachlichen Kompetenzen und kultureller Identität

Author: Mushchinina Maria
Rabus Achim
Stöhler Ursula
Publication venue: Humboldt-Universität zu Berlin
Publication date: 01/05/2023
Field of study

Vorgestellt wird eine Studie zu den Erwartungen der Eltern von Kindern mit russischsprachigem Hintergrund, welche den gymnasialen Russischunterricht besuchen. Dabei wurde die Motivation der Eltern erfragt sowie die Erwartungen an die Entwicklung der Sprachkompetenzen der Lernenden und an die Rolle der Lehrperson. Zusätzlich wurde ermittelt, welche Unterstützung Eltern bei der Auf-rechterhaltung der russischen Herkunftssprache und –kultur für ihre Kinder bislang in Anspruch genommen haben.3. Tagung des Arbeitskreises Didaktik der Slawischen Sprachen, Berlin, 19.-20.02.202

Dokumenten-Publikationsserver der Humboldt-Universität zu Berlin

TerraSAR-X Time Series Fill a Gap in Spaceborne Snowmelt Monitoring of Small Arctic Catchments—A Case Study on Qikiqtaruk(Herschel Island), Canada

Author: Bartsch Annett
Eppler Jayson
Heim Birgit
Lantuit Hugues
Rabus Bernhard
Roth Achim
Stettner Samuel
Publication venue: 'MDPI AG'
Publication date: 01/01/2018
Field of study

The timing of snowmelt is an important turning point in the seasonal cycle of small Arctic catchments. The TerraSAR-X (TSX) satellite mission is a synthetic aperture radar system (SAR) with high potential to measure the high spatiotemporal variability of snow cover extent (SCE) and fractional snow cover (FSC) on the small catchment scale. We investigate the performance of multi-polarized and multi-pass TSX X-Band SAR data in monitoring SCE and FSC in small Arctic tundra catchments of Qikiqtaruk (Herschel Island) off the Yukon Coast in the Western Canadian Arctic. We applied a threshold based segmentation on ratio images between TSX images with wet snow and a dry snow reference, and tested the performance of two different thresholds. We quantitatively compared TSX- and Landsat 8-derived SCE maps using confusion matrices and analyzed the spatiotemporal dynamics of snowmelt from 2015 to 2017 using TSX, Landsat 8 and in situ time lapse data. Our data showed that the quality of SCE maps from TSX X-Band data is strongly influenced by polarization and to a lesser degree by incidence angle. VH polarized TSX data performed best in deriving SCE when compared to Landsat 8. TSX derived SCE maps from VH polarization detected late lying snow patches that were not detected by Landsat 8. Results of a local assessment of TSX FSC against the in situ data showed that TSX FSC accurately captured the temporal dynamics of different snow melt regimes that were related to topographic characteristics of the studied catchments. Both in situ and TSX FSC showed a longer snowmelt period in a catchment with higher contributions of steep valleys and a shorter snowmelt period in a catchment with higher contributions of upland terrain. Landsat 8 had fundamental data gaps during the snowmelt period in all 3 years due to cloud cover. The results also revealed that by choosing a positive threshold of 1 dB, detection of ice layers due to diurnal temperature variations resulted in a more accurate estimation of snow cover than a negative threshold that detects wet snow alone. We find that TSX X-Band data in VH polarization performs at a comparable quality to Landsat 8 in deriving SCE maps when a positive threshold is used. We conclude that TSX data polarization can be used to accurately monitor snowmelt events at high temporal and spatial resolution, overcoming limitations of Landsat 8, which due to cloud related data gaps generally only indicated the onset and end of snowmelt

Institute of Transport Research:Publications

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

publish.UP (Univ. Potsdam)

Electronic Publication Information Center

Mumbling through a wall : Clustering Slavic dialects using hierarchical statistical modeling of prosody

Author: Daniel Michael
Dobrushina Nina
Hiovain Katri
Rabus Achim
Simko Juraj
Suni Antti Santeri
Vainio Martti Tapani
von Waldenfels Ruprecht
Publication venue
Publication date: 01/10/2018
Field of study

Non peer reviewe

Helsingin yliopiston digitaalinen arkisto

Bulletin der deutschen Slavistik 22.2016

Bulletin der deutschen Slavistik 22, 201

Forschungsinformationssystem der Universität Bamberg

Exploring data provenance in handwritten text recognition infrastructure:Sharing and reusing ground truth data, referencing models, and acknowledging contributions. Starting the conversation on how we could get it done

Author: Afolabi Mary Aderonke
Anikina Anastasiia
Bastianello Elisa
Benzinger Lukas Vincent
Bhatia Aakriti
Bosse Arno
Brown David
Chagué Alix
Charlton Ashleigh
Depuydt Katrien
Go Sabine C. P. J.
Goh Marcus J.C.
Gordijn Femke
Gstrein Silvia
Hasan Sewa
Hindermann Maximilian
Hodel Tobias
Huff Dorothee
Huysman Ineke
Idris Ali
Keijser Liesbeth
Keijzer Carlijn
Kemper Simon
Koenders Sanne
Kuijpers Erika
Lepa Sven
Link Tommy O.
Nilsson Dannevig André
Nockels Joe
Oosterhuis Joost Johannes
Popken Vivien
Puertollano María Estrella
Purcell Jake
Puusaag Joosep J.
Rabus Achim
Romein C. Annemieke
Rønsig Larsen Lisette
Sheta Ahmed
Sitaram Chantal
Stauder Andy
Stoop Lex
Strandgaard Jensen Helle
Strutzenbladh Ebba
Terras Melissa
Trouw Barry Benaissa
van den Heuvel Pauline
van der Sijs Nicoline
van der Spek Jan Paul
van Gelder Klaas
van Lange Milan
van Nispen Annalies
van Noort Laura M.
Van Synghel Geertrui
van Zundert Joris
von der Heide Stefan
Vuckovic Vladimir
Weiss Sonia
Wilbrink Heleen
Wrisley David Joseph
Zweistra Riet
Publication venue
Publication date: 18/03/2024
Field of study

This paper discusses best practices for sharing and reusing Ground Truth in Handwritten Text Recognition infrastructures, and ways to reference and acknowledge contributions to the creation and enrichment of data within these Machine Learning systems. We discuss how one can publish Ground Truth data in a repository and, subsequently, inform others. Furthermore, we suggest appropriate citation methods for HTR data, models, and contributions made by volunteers. Moreover, when using digitised sources (digital facsimiles), it becomes increasingly important to distinguish between the physical object and the digital collection. These topics all relate to the proper acknowledgement of labour put into digitising, transcribing, and sharing Ground Truth HTR data. This also points to broader issues surrounding the use of Machine Learning in archival and library contexts, and how the community should begin toacknowledge and record both contributions and data provenance

Edinburgh Research Explorer