    Arabic Manuscript Layout Analysis and Classification

    Image Enhancement Background for High Damage Malay Manuscripts using Adaptive Threshold Binarization

    Jawi Manuscripts handwritten which are kept at Malaysia National Library (MNL), has aged over decades. Regardless of the intensive sustainable process conducted by MNL, these manuscripts are still not maintained in good quality, and neither can easily be read nor better view. Even thought, many states of the art methods have developed for image enhancement, none of them can solve extremely bad quality manuscripts. The quality of old Malay Manuscripts can be categorize into three types, namely: the background image is uneven, image effects and image effects expand patch. The aim of this paper is to discuss the methods used to value add the quality of the manuscript.  Our propose methods consist of several main methods, such as: Local Adaptive Equalization, Image Intensity Values, Automatic Threshold PP, and Adaptive Threshold Filtering. This paper is intend to achieve a better view image that geared to ease reading. Error Bit Phase achievement (TKB) has a smaller error value for proposed method (Adaptive Threshold Filtering Process / PAM) namely 0.0316 compared with Otsu’s Threshold Method / MNAO, Binary Threshold Value Method / MNAP, and Automatic Local Threshold Value Method / MNATA. The precision achievement (namely on ink bleed images) is using a proposed method more than 95% is compared with the state of the art methods MNAO, MNAP, MNATA and their performances are 75.82%, 90.68%, and 91.2% subsequently.  However, this paper’s achievement is using a proposed method / PAM, MNAO, MNAP, and MNATA for correspondingly the image of ink bleed case are 45.74%, 54.80%, 53.23% and 46.02%.  Conclusion, the proposed method produces a better character shape in comparison to other methods

    The 3rd International Conference on African Digital Libraries and Archives Digital Libraries and Archives in Africa: Changing Lives and Building Communities

    The corpus of Moroccan manuscripts is estimated at more than 80,000 titles and 200,000 volumes held at a number of public and private libraries—mostly religious institutions and zawāyā. These collections are invaluable both as repositories of human knowledge and memory and for their aesthetic value in terms of calligraphy, illumination, iconography and craftsmanship. Several medieval authors position Morocco as an important center in the Muslim West (al-Gharb al-Islami) for manuscript production, illumination, binding and exchange. However, except for a few scattered publications, a history of North African Arabic calligraphy (al-khatt al-maghribi) remains to be written. By providing the tools for making these collections readily accessible to the scholarly community in the Maghrib and beyond, ICT will make possible the study of North African scripts within the broader context of Arabic calligraphy and the Islamic arts of the book in general. The two main manuscript collections in Morocco are hosted at the National Library of Morocco (Bibliothèque nationale du royaume du Maroc, or BNRM, formerly General Library and Archive) in Rabat (12,140 titles), and the Qarawiyyin Library in Fez (5,600 titles, 3,157 of which in several volumes). Theses collections originated mostly from waqf (pious endowments) and state appropriation of private collections (e.g., 1,311 and 3,371 titles from the al-Glawi and al-Kattani collections respectively). They are written almost entirely in Arabic and in various scripts; Amazigh (Berber) manuscripts in Arabic script and Hebrew manuscripts constitute less than one percent of the total collections

    Arabic Manuscripts Analysis and Retrieval

    Analyse d’images de documents patrimoniaux : une approche structurelle à base de texture

    Over the last few years, there has been tremendous growth in digitizing collections of cultural heritage documents. Thus, many challenges and open issues have been raised, such as information retrieval in digital libraries or analyzing page content of historical books. Recently, an important need has emerged which consists in designing a computer-aided characterization and categorization tool, able to index or group historical digitized book pages according to several criteria, mainly the layout structure and/or typographic/graphical characteristics of the historical document image content. Thus, the work conducted in this thesis presents an automatic approach for characterization and categorization of historical book pages. The proposed approach is applicable to a large variety of ancient books. In addition, it does not assume a priori knowledge regarding document image layout and content. It is based on the use of texture and graph algorithms to provide a rich and holistic description of the layout and content of the analyzed book pages to characterize and categorize historical book pages. The categorization is based on the characterization of the digitized page content by texture, shape, geometric and topological descriptors. This characterization is represented by a structural signature. More precisely, the signature-based characterization approach consists of two main stages. The first stage is extracting homogeneous regions. Then, the second one is proposing a graph-based page signature which is based on the extracted homogeneous regions, reflecting its layout and content. Afterwards, by comparing the different obtained graph-based signatures using a graph-matching paradigm, the similarities of digitized historical book page layout and/or content can be deduced. Subsequently, book pages with similar layout and/or content can be categorized and grouped, and a table of contents/summary of the analyzed digitized historical book can be provided automatically. As a consequence, numerous signature-based applications (e.g. information retrieval in digital libraries according to several criteria, page categorization) can be implemented for managing effectively a corpus or collections of books. To illustrate the effectiveness of the proposed page signature, a detailed experimental evaluation has been conducted in this work for assessing two possible categorization applications, unsupervised page classification and page stream segmentation. In addition, the different steps of the proposed approach have been evaluated on a large variety of historical document images.Les récents progrès dans la numérisation des collections de documents patrimoniaux ont ravivé de nouveaux défis afin de garantir une conservation durable et de fournir un accès plus large aux documents anciens. En parallèle de la recherche d'information dans les bibliothèques numériques ou l'analyse du contenu des pages numérisées dans les ouvrages anciens, la caractérisation et la catégorisation des pages d'ouvrages anciens a connu récemment un regain d'intérêt. Les efforts se concentrent autant sur le développement d'outils rapides et automatiques de caractérisation et catégorisation des pages d'ouvrages anciens, capables de classer les pages d'un ouvrage numérisé en fonction de plusieurs critères, notamment la structure des mises en page et/ou les caractéristiques typographiques/graphiques du contenu de ces pages. Ainsi, dans le cadre de cette thèse, nous proposons une approche permettant la caractérisation et la catégorisation automatiques des pages d'un ouvrage ancien. L'approche proposée se veut indépendante de la structure et du contenu de l'ouvrage analysé. Le principal avantage de ce travail réside dans le fait que l'approche s'affranchit des connaissances préalables, que ce soit concernant le contenu du document ou sa structure. Elle est basée sur une analyse des descripteurs de texture et une représentation structurelle en graphe afin de fournir une description riche permettant une catégorisation à partir du contenu graphique (capturé par la texture) et des mises en page (représentées par des graphes). En effet, cette catégorisation s'appuie sur la caractérisation du contenu de la page numérisée à l'aide d'une analyse des descripteurs de texture, de forme, géométriques et topologiques. Cette caractérisation est définie à l'aide d'une représentation structurelle. Dans le détail, l'approche de catégorisation se décompose en deux étapes principales successives. La première consiste à extraire des régions homogènes. La seconde vise à proposer une signature structurelle à base de texture, sous la forme d'un graphe, construite à partir des régions homogènes extraites et reflétant la structure de la page analysée. Cette signature assure la mise en œuvre de nombreuses applications pour gérer efficacement un corpus ou des collections de livres patrimoniaux (par exemple, la recherche d'information dans les bibliothèques numériques en fonction de plusieurs critères, ou la catégorisation des pages d'un même ouvrage). En comparant les différentes signatures structurelles par le biais de la distance d'édition entre graphes, les similitudes entre les pages d'un même ouvrage en termes de leurs mises en page et/ou contenus peuvent être déduites. Ainsi de suite, les pages ayant des mises en page et/ou contenus similaires peuvent être catégorisées, et un résumé/une table des matières de l'ouvrage analysé peut être alors généré automatiquement. Pour illustrer l'efficacité de la signature proposée, une étude expérimentale détaillée a été menée dans ce travail pour évaluer deux applications possibles de catégorisation de pages d'un même ouvrage, la classification non supervisée de pages et la segmentation de flux de pages d'un même ouvrage. En outre, les différentes étapes de l'approche proposée ont donné lieu à des évaluations par le biais d'expérimentations menées sur un large corpus de documents patrimoniaux

    An intelligent framework for pre-processing ancient Thai manuscripts on palm leaves

    In Thailand’s early history, prior to the availability of paper and printing technologies, palm leaves were used to record information written by hand. These ancient documents contain invaluable knowledge. By digitising the manuscripts, the content can be preserved and made widely available to the interested community via electronic media. However, the content is difficult to access or retrieve. In order to extract relevant information from the document images efficiently, each step of the process requires reduction of irrelevant data such as noise or interference on the images. The pre-processing techniques serve the purpose of extracting regions of interest, reducing noise from the image and degrading the irrelevant background. The image can then be directly and efficiently processed for feature selection and extraction prior to the subsequent phase of character recognition. It is therefore the main objective of this study to develop an efficient and intelligent image preprocessing system that could be used to extract components from ancient manuscripts for information extraction and retrieval purposes. The main contributions of this thesis are the provision and enhancement of the region of interest by using an intelligent approach for the pre-processing of ancient Thai manuscripts on palm leaves and a detailed examination of the preprocessing techniques for palm leaf manuscripts. As noise reduction and binarisation are involved in the first step of pre-processing to eliminate noise and background from image documents, it is necessary for this step to provide a good quality output; otherwise, the accuracy of the subsequent stages will be affected. In this work, an intelligent approach to eliminate background was proposed and carried out by a selection of appropriate binarisation techniques using SVM. As there could be multiple binarisation techniques of choice, another approach was proposed to eliminate the background in this study in order to generate an optimal binarised image. The proposal is an ensemble architecture based on the majority vote scheme utilising local neighbouring information around a pixel of interest. To extract text from that binarised image, line segmentation was then applied based on the partial projection method as this method provides good results with slant texts and connected components. To improve the quality of the partial projection method, an Adaptive Partial Projection (APP) method was proposed. This technique adjusts the size of a character strip automatically by adapting the width of the strip to separate the connected component of consecutive lines through divide and conquer, and analysing the upper vowels and lower vowels of the text line. Finally, character segmentation was proposed using a hierarchical segmentation technique based on a contour-tracing algorithm. Touching components identified from the previous step were then separated by a trace of the background skeletons, and a combined method of segmentation. The key datasets used in this study are images provided by the Project for Palm Leaf Preservation, Northeastern Thailand Division, and benchmark datasets from the Document Image Binarisation Contest (DIBCO) series are used to compare the results of this work against other binarisation techniques. The experimental results have shown that the proposed methods in this study provide superior performance and will be used to support subsequent processing of the Thai ancient palm leaf documents. It is expected that the contributions from this study will also benefit research work on ancient manuscripts in other languages