5 research outputs found
Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts
Historical palm-leaf manuscript and early paper documents from Indian
subcontinent form an important part of the world's literary and cultural
heritage. Despite their importance, large-scale annotated Indic manuscript
image datasets do not exist. To address this deficiency, we introduce
Indiscapes, the first ever dataset with multi-regional layout annotations for
historical Indic manuscripts. To address the challenge of large diversity in
scripts and presence of dense, irregular layout elements (e.g. text lines,
pictures, multiple documents per image), we adapt a Fully Convolutional Deep
Neural Network architecture for fully automatic, instance-level spatial layout
parsing of manuscript images. We demonstrate the effectiveness of proposed
architecture on images from the Indiscapes dataset. For annotation flexibility
and keeping the non-technical nature of domain experts in mind, we also
contribute a custom, web-based GUI annotation tool and a dashboard-style
analytics portal. Overall, our contributions set the stage for enabling
downstream applications such as OCR and word-spotting in historical Indic
manuscripts at scale.Comment: Oral presentation at International Conference on Document Analysis
and Recognition (ICDAR) - 2019. For dataset, pre-trained networks and
additional details, visit project page at http://ihdia.iiit.ac.in
A Scheme Towards Automatic Word Indexation System for Balinese Palm Leaf Manuscripts
This paper proposes an initial scheme towards the development of an automatic word indexation system for Balinese lontar (palm leaf manuscript) collections. The word indexation system scheme consists of a sub module for patch image extraction of text areas in lontars and a sub module for word image transliteration. This is the first word indexation system for lontar collections to be proposed. To detect parts of a lontar image that contain text, a Gabor filter is used to provide initial information about the presence of text texture in the image. An adaptive sliding patch algorithm for the extraction of patch images in lontars is also proposed. The word image transliteration sub module was built using the long short-term memory (LSTM) model. The results showed that the image patch extraction of text areas process succeeded in optimally detecting text areas in lontars and extracting the patch image in a suitable position. The proposed scheme successfully extracted between 20% to 40% of the keywords in lontars and thus can at least provide an initial description for prospective lontar readers of the content contained in a lontar collection or to find in which lontar collection certain keywords can be found
Binarization strategy using multiple convolutional autoencoder network for old Sundanese manuscript images
Computer Systems, Imagery and Medi
Towards robust real-world historical handwriting recognition
In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data
ICFHR 2018 Competition On Document Image Analysis Tasks for Southeast Asian Palm Leaf Manuscripts
This paper presents the results of the Competition on Document Image Analysis Tasks for Southeast Asian Palm Leaf Manuscripts that was organized in the context of the 16th International Conference on Frontiers in Handwriting Recognition (ICFHR-2018). For this competition, three different corpus of palm leaf manuscripts written in three different scripts and languages (Balinese, Sundanese and Khmer) are used. Four Document Image Analysis (DIA) tasks are proposed as the challenges in this competition: binarization, text line segmentation, isolated character/glyph recognition, and word transliteration. The results of this competition will be very useful in benchmarking analysis for the collection of palm leaf manuscripts, accelerating, evaluating and improving the performance of existing DIA system for a new type of document collection. This paper describes the competition details including the dataset, the evaluation measures used, a short description of each participant as well as the performance of the all submitted methods