394 research outputs found
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
A Bottom Up Procedure for Text Line Segmentation of Latin Script
In this paper we present a bottom up procedure for segmentation of text lines
written or printed in the Latin script. The proposed method uses a combination
of image morphology, feature extraction and Gaussian mixture model to perform
this task. The experimental results show the validity of the procedure.Comment: Accepted and presented at the IEEE conference "International
Conference on Advances in Computing, Communications and Informatics (ICACCI)
2017
Text lines and snippets extraction for 19th century handwriting documents<br /> layout analysis
International audienceIn this paper we propose a new approach to improve electronic editions of human science corpus, providing an efficient estimation of manuscripts pages structure. In any handwriting documents analysis process, the text line segmentation is an important stage. The presence of variable inter-line spaces, of inconstant base-line skews, overlapping and occlusions in unconstrained ancient 19th handwritten documents complexifies the text lines segmentation task. In this paper, we only use as prior knowledge of script the fact that text lines skews can be random and irregular. In that context, we model text line detection as an image segmentation problem by enhancing text line structure using Hough transform and a clustering of connected components so as to make text line boundaries appear. The proposed approach of snippets decomposition for page layout analysis lies on a first step of content pages classification in five visual and genetic taxonomies, and a second step of text line extraction and snippets decomposition. Experiments show that the proposed method achieves high accuracy for detecting text lines in regular and semi-regular handwritten pages in the corpus of digitized Flaubert manuscripts ("Dossiers documentaires de Bouvard et Pécuchet", 1872-1880)
A perceptive method for handwritten text segmentation
International audienceThis paper presents a new method to address the problem of handwritten text segmentation into text lines and words. Thus, we propose a method based on the cooperation among points of view that enables the localization of the text lines in a low resolution image, and then to associate the pixels at a higher level of resolution. Thanks to the combination of levels of vision, we can detect overlapping characters and re-segment the connected components during the analysis. Then, we propose a segmentation of lines into words based on the cooperation among digital data and symbolic knowledge. The digital data are obtained from distances inside a Delaunay graph, which gives a precise distance between connected components, at the pixel level. We introduce structural rules in order to take into account some generic knowledge about the organization of a text page. This cooperation among information gives a bigger power of expression and ensures the global coherence of the recognition. We validate this work using the metrics and the database proposed for the segmentation contest of ICDAR 2009. Thus, we show that our method obtains very interesting results, compared to the other methods of the literature. More precisely, we are able to deal with slope and curvature, overlapping text lines and varied kinds of writings, which are the main diculties met by the other methods
Detection of Text Lines of Handwritten Arabic Manuscripts using Markov Decision Processes
In a character recognition systems, the segmentation phase is critical since the accuracy of the recognition depend strongly on it. In this paper we present an approach based on Markov Decision Processes to extract text lines from binary images of Arabic handwritten documents. The proposed approach detects the connected components belonging to the same line by making use of knowledge about features and arrangement of those components. The initial results show that the system is promising for extracting Arabic handwritten lines
Content Recognition and Context Modeling for Document Analysis and Retrieval
The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge.
In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting.
Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification.
Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features.
Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance
An Integrated architecture for recognition of totally unconstrained handwritten numerals
Reprint. Reprinted from the International journal of pattern recognition and artificial intelligence. Vol. 7, no. 4 (1993) "January 1993."Includes bibliographical references (p. 127-128).Supported by the Productivity From Information Technology (PROFIT) Research Initiative at MIT.Amar Gupta ... [et al.
- …