Search CORE

5 research outputs found

Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts

Author: Aitha Sowmya
Prusty Abhishek
Sarvadevabhatla Ravi Kiran
Trivedi Abhishek
Publication venue
Publication date: 15/12/2019
Field of study

Historical palm-leaf manuscript and early paper documents from Indian subcontinent form an important part of the world's literary and cultural heritage. Despite their importance, large-scale annotated Indic manuscript image datasets do not exist. To address this deficiency, we introduce Indiscapes, the first ever dataset with multi-regional layout annotations for historical Indic manuscripts. To address the challenge of large diversity in scripts and presence of dense, irregular layout elements (e.g. text lines, pictures, multiple documents per image), we adapt a Fully Convolutional Deep Neural Network architecture for fully automatic, instance-level spatial layout parsing of manuscript images. We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. For annotation flexibility and keeping the non-technical nature of domain experts in mind, we also contribute a custom, web-based GUI annotation tool and a dashboard-style analytics portal. Overall, our contributions set the stage for enabling downstream applications such as OCR and word-spotting in historical Indic manuscripts at scale.Comment: Oral presentation at International Conference on Document Analysis and Recognition (ICDAR) - 2019. For dataset, pre-trained networks and additional details, visit project page at http://ihdia.iiit.ac.in

arXiv.org e-Print Archive

Crossref

A Scheme Towards Automatic Word Indexation System for Balinese Palm Leaf Manuscripts

Author: Kesiman Made Windu Antara
Pradnyana Gede Aditra
Publication venue: LPPM ITBis Lembah Dempo
Publication date: 01/10/2021
Field of study

This paper proposes an initial scheme towards the development of an automatic word indexation system for Balinese lontar (palm leaf manuscript) collections. The word indexation system scheme consists of a sub module for patch image extraction of text areas in lontars and a sub module for word image transliteration. This is the first word indexation system for lontar collections to be proposed. To detect parts of a lontar image that contain text, a Gabor filter is used to provide initial information about the presence of text texture in the image. An adaptive sliding patch algorithm for the extraction of patch images in lontars is also proposed. The word image transliteration sub module was built using the long short-term memory (LSTM) model. The results showed that the image patch extraction of text areas process succeeded in optimally detecting text areas in lontars and extracting the patch image in a suitable position. The proposed scheme successfully extracted between 20% to 40% of the keywords in lontars and thus can at least provide an initial description for prospective lontar readers of the content contained in a lontar collection or to find in which lontar collection certain keywords can be found

Journal of ICT Research and Applications

Directory of Open Access Journals

ITB Journal

Binarization strategy using multiple convolutional autoencoder network for old Sundanese manuscript images

Author: Burie J-C.
Paulus E.
Verbeek F.J.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/09/2021
Field of study

Computer Systems, Imagery and Medi

Leiden University Scholary Publications

Towards robust real-world historical handwriting recognition

Author: Ameryan Mahya
Publication venue: University of Groningen
Publication date: 01/01/2023
Field of study

In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data

Dissertations of the University of Groningen

ICFHR 2018 Competition On Document Image Analysis Tasks for Southeast Asian Palm Leaf Manuscripts

Author: 16th International Conference on Frontiers in Handwriting Recognition
Burie Jean-Christophe
Chhun Sophea
Hadi Setiawan
Kesiman Made Windu Antara
Ogier Jean-Marc
Paulus Erick
Suryani Mira
Valy Dona
Verleysen Michel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

This paper presents the results of the Competition on Document Image Analysis Tasks for Southeast Asian Palm Leaf Manuscripts that was organized in the context of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR-2018). For this competition, three different corpus of palm leaf manuscripts written in three different scripts and languages (Balinese, Sundanese and Khmer) are used. Four Document Image Analysis (DIA) tasks are proposed as the challenges in this competition: binarization, text line segmentation, isolated character/glyph recognition, and word transliteration. The results of this competition will be very useful in benchmarking analysis for the collection of palm leaf manuscripts, accelerating, evaluating and improving the performance of existing DIA system for a new type of document collection. This paper describes the competition details including the dataset, the evaluation measures used, a short description of each participant as well as the performance of the all submitted methods

Crossref

DIAL UCLouvain

HAL-Paris1