29,809 research outputs found

    On-the-fly Historical Handwritten Text Annotation

    Full text link
    The performance of information retrieval algorithms depends upon the availability of ground truth labels annotated by experts. This is an important prerequisite, and difficulties arise when the annotated ground truth labels are incorrect or incomplete due to high levels of degradation. To address this problem, this paper presents a simple method to perform on-the-fly annotation of degraded historical handwritten text in ancient manuscripts. The proposed method aims at quick generation of ground truth and correction of inaccurate annotations such that the bounding box perfectly encapsulates the word, and contains no added noise from the background or surroundings. This method will potentially be of help to historians and researchers in generating and correcting word labels in a document dynamically. The effectiveness of the annotation method is empirically evaluated on an archival manuscript collection from well-known publicly available datasets

    Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts

    Full text link
    Historical palm-leaf manuscript and early paper documents from Indian subcontinent form an important part of the world's literary and cultural heritage. Despite their importance, large-scale annotated Indic manuscript image datasets do not exist. To address this deficiency, we introduce Indiscapes, the first ever dataset with multi-regional layout annotations for historical Indic manuscripts. To address the challenge of large diversity in scripts and presence of dense, irregular layout elements (e.g. text lines, pictures, multiple documents per image), we adapt a Fully Convolutional Deep Neural Network architecture for fully automatic, instance-level spatial layout parsing of manuscript images. We demonstrate the effectiveness of proposed architecture on images from the Indiscapes dataset. For annotation flexibility and keeping the non-technical nature of domain experts in mind, we also contribute a custom, web-based GUI annotation tool and a dashboard-style analytics portal. Overall, our contributions set the stage for enabling downstream applications such as OCR and word-spotting in historical Indic manuscripts at scale.Comment: Oral presentation at International Conference on Document Analysis and Recognition (ICDAR) - 2019. For dataset, pre-trained networks and additional details, visit project page at http://ihdia.iiit.ac.in

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    Digital Palaeography

    Get PDF
    This article seeks to explore new digital ways of distinguishing between scribal hands in medieval manuscripts. An analysis of traditional palaeographical approaches to hand identification will be followed by a discussion in which attention will be paid both to the use of computer software to enhance existing methods of scribal identification, and to the benefits of "Quill", an innovative automatic writer identification tool. A case study involving a manuscript of the collected works of Christine de Pizan (London, British Library, Harley 4431) will serve to demonstrate that traditional palaeographical methods of analysing scribal hands can greatly benefit from the use of specialised computer software

    Automatic Palaeographic Exploration of Genizah Manuscripts

    Get PDF
    The Cairo Genizah is a collection of hand-written documents containing approximately 350,000 fragments of mainly Jewish texts discovered in the late 19th century. The fragments are today spread out in some 75 libraries and private collections worldwide, but there is an ongoing effort to document and catalogue all extant fragments. Palaeographic information plays a key role in the study of the Genizah collection. Script style, and–more specifically–handwriting, can be used to identify fragments that might originate from the same original work. Such matched fragments, commonly referred to as “joins”, are currently identified manually by experts, and presumably only a small fraction of existing joins have been discovered to date. In this work, we show that automatic handwriting matching functions, obtained from non-specific features using a corpus of writing samples, can perform this task quite reliably. In addition, we explore the problem of grouping various Genizah documents by script style, without being provided any prior information about the relevant styles. The automatically obtained grouping agrees, for the most part, with the palaeographic taxonomy. In cases where the method fails, it is due to apparent similarities between related scripts

    Sharing Cultural Heritage: the Clavius on the Web Project

    Get PDF
    In the last few years the amount of manuscripts digitized and made available on the Web has been constantly increasing. However, there is still a considarable lack of results concerning both the explicitation of their content and the tools developed to make it available. The objective of the Clavius on the Web project is to develop a Web platform exposing a selection of Christophorus Clavius letters along with three different levels of analysis: linguistic, lexical and semantic. The multilayered annotation of the corpus involves a XML-TEI encoding followed by a tokenization step where each token is univocally identified through a CTS urn notation and then associated to a part-of-speech and a lemma. The text is lexically and semantically annotated on the basis of a lexicon and a domain ontology, the former structuring the most relevant terms occurring in the text and the latter representing the domain entities of interest (e.g. people, places, etc.). Moreover, each entity is connected to linked and non linked resources, including DBpedia and VIAF. Finally, the results of the three layers of analysis are gathered and shown through interactive visualization and storytelling techniques. A demo version of the integrated architecture was developed

    Persian Heritage Image Binarization Competition (PHIBC 2012)

    Full text link
    The first competition on the binarization of historical Persian documents and manuscripts (PHIBC 2012) has been organized in conjunction with the first Iranian conference on pattern recognition and image analysis (PRIA 2013). The main objective of PHIBC 2012 is to evaluate performance of the binarization methodologies, when applied on the Persian heritage images. This paper provides a report on the methodology and performance of the three submitted algorithms based on evaluation measures has been used.Comment: 4 pages, 2 figures, conferenc

    Detecting Authorship, Hands, and Corrections in Historical Manuscripts. A Mixedmethods Approach towards the Unpublished Writings of an 18th Century Czech Emigré Community in Berlin (Handwriting)

    Full text link
    When one starts working philologically with historical manuscripts, one faces important first questions involving authorship, writers’ hands andthe history of documenttransmission. These issues are especially thorny with documents remaining outside the established canon, such as privatemanuscripts, aboutwhichwehave very restrictedtext-externalinformation. In this area – so we argue – it is especially fruitful to employ a mixed-methods approach, combiningtailored automatic methods from image recognition/analysis with philological and linguistic knowledge.Whileimage analysis captureswriters’ hands, linguistic/philological research mainly addressestextual authorship;thetwo cross-fertilize and obtain a coherent interpretation which may then be evaluated against the available text-external historical evidence. Departingfrom our ‘lab case’,whichis a corpus of unedited Czechmanuscriptsfromthe archive of a small 18th century migrant community, the Herrnhuter BrĂŒdergemeine (Brethren parish) in Berlin-Neukölln, our project has developed an assistance system which aids philologists in working with digitized (scanned) hand-written historical sources. We present its application and discuss its general potential and methodological implications
    • 

    corecore