29,809 research outputs found
On-the-fly Historical Handwritten Text Annotation
The performance of information retrieval algorithms depends upon the
availability of ground truth labels annotated by experts. This is an important
prerequisite, and difficulties arise when the annotated ground truth labels are
incorrect or incomplete due to high levels of degradation. To address this
problem, this paper presents a simple method to perform on-the-fly annotation
of degraded historical handwritten text in ancient manuscripts. The proposed
method aims at quick generation of ground truth and correction of inaccurate
annotations such that the bounding box perfectly encapsulates the word, and
contains no added noise from the background or surroundings. This method will
potentially be of help to historians and researchers in generating and
correcting word labels in a document dynamically. The effectiveness of the
annotation method is empirically evaluated on an archival manuscript collection
from well-known publicly available datasets
Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts
Historical palm-leaf manuscript and early paper documents from Indian
subcontinent form an important part of the world's literary and cultural
heritage. Despite their importance, large-scale annotated Indic manuscript
image datasets do not exist. To address this deficiency, we introduce
Indiscapes, the first ever dataset with multi-regional layout annotations for
historical Indic manuscripts. To address the challenge of large diversity in
scripts and presence of dense, irregular layout elements (e.g. text lines,
pictures, multiple documents per image), we adapt a Fully Convolutional Deep
Neural Network architecture for fully automatic, instance-level spatial layout
parsing of manuscript images. We demonstrate the effectiveness of proposed
architecture on images from the Indiscapes dataset. For annotation flexibility
and keeping the non-technical nature of domain experts in mind, we also
contribute a custom, web-based GUI annotation tool and a dashboard-style
analytics portal. Overall, our contributions set the stage for enabling
downstream applications such as OCR and word-spotting in historical Indic
manuscripts at scale.Comment: Oral presentation at International Conference on Document Analysis
and Recognition (ICDAR) - 2019. For dataset, pre-trained networks and
additional details, visit project page at http://ihdia.iiit.ac.in
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
Digital Palaeography
This article seeks to explore new digital ways of distinguishing between scribal hands in medieval manuscripts. An analysis of traditional palaeographical approaches to hand identification will be followed by a discussion in which attention will be paid both to the use of computer software to enhance existing methods of scribal identification, and to the benefits of "Quill", an innovative automatic writer identification tool. A case study involving a manuscript of the collected works of Christine de Pizan (London, British Library, Harley 4431) will serve to demonstrate that traditional palaeographical methods of analysing scribal hands can greatly benefit from the use of specialised computer software
Automatic Palaeographic Exploration of Genizah Manuscripts
The Cairo Genizah is a collection of hand-written documents containing approximately
350,000 fragments of mainly Jewish texts discovered in the late 19th
century. The
fragments are today spread out in some 75 libraries and private collections worldwide,
but there is an ongoing effort to document and catalogue all extant fragments.
Palaeographic information plays a key role in the study of the Genizah collection.
Script style, andâmore specificallyâhandwriting, can be used to identify fragments that
might originate from the same original work. Such matched fragments, commonly
referred to as âjoinsâ, are currently identified manually by experts, and presumably only
a small fraction of existing joins have been discovered to date. In this work, we show
that automatic handwriting matching functions, obtained from non-specific features
using a corpus of writing samples, can perform this task quite reliably. In addition, we
explore the problem of grouping various Genizah documents by script style, without
being provided any prior information about the relevant styles. The automatically
obtained grouping agrees, for the most part, with the palaeographic taxonomy. In cases
where the method fails, it is due to apparent similarities between related scripts
Sharing Cultural Heritage: the Clavius on the Web Project
In the last few years the amount of manuscripts digitized and made available on the Web has been constantly increasing. However, there is still a considarable lack of results concerning both the explicitation of their content and the tools developed to make it available. The objective of the Clavius on the Web project is to develop a Web platform exposing a selection of Christophorus Clavius letters along with three different levels of analysis: linguistic, lexical and semantic. The multilayered annotation of the corpus involves a XML-TEI encoding followed by a tokenization step where each token is univocally identified through a CTS urn notation and then associated to a part-of-speech and a lemma. The text is lexically and semantically annotated on the basis of a lexicon and a domain ontology, the former structuring the most relevant terms occurring in the text and the latter representing the domain entities of interest (e.g. people, places, etc.). Moreover, each entity is connected to linked and non linked resources, including DBpedia and VIAF. Finally, the results of the three layers of analysis are gathered and shown through interactive visualization and storytelling techniques. A demo version of the integrated architecture was developed
Persian Heritage Image Binarization Competition (PHIBC 2012)
The first competition on the binarization of historical Persian documents and
manuscripts (PHIBC 2012) has been organized in conjunction with the first
Iranian conference on pattern recognition and image analysis (PRIA 2013). The
main objective of PHIBC 2012 is to evaluate performance of the binarization
methodologies, when applied on the Persian heritage images. This paper provides
a report on the methodology and performance of the three submitted algorithms
based on evaluation measures has been used.Comment: 4 pages, 2 figures, conferenc
Detecting Authorship, Hands, and Corrections in Historical Manuscripts. A Mixedmethods Approach towards the Unpublished Writings of an 18th Century Czech Emigré Community in Berlin (Handwriting)
When one starts working philologically with historical manuscripts, one faces important first questions involving authorship, writersâ hands andthe history of documenttransmission. These issues are especially thorny with documents remaining outside the established canon, such as privatemanuscripts, aboutwhichwehave very restrictedtext-externalinformation. In this area â so we argue â it is especially fruitful to employ a mixed-methods approach, combiningtailored automatic methods from image recognition/analysis with philological and linguistic knowledge.Whileimage analysis captureswritersâ hands, linguistic/philological research mainly addressestextual authorship;thetwo cross-fertilize and obtain a coherent interpretation which may then be evaluated against the available text-external historical evidence. Departingfrom our âlab caseâ,whichis a corpus of unedited Czechmanuscriptsfromthe archive of a small 18th century migrant community, the Herrnhuter BrĂŒdergemeine (Brethren parish) in Berlin-Neukölln, our project has developed an assistance system which aids philologists in working with digitized (scanned) hand-written historical sources. We present its application and discuss its general potential and methodological implications
- âŠ