Search CORE

2 research outputs found

A Statistically Based, Highly Accurate Text-line Segmentation Method

Author: Ihsin T. Phillips
Jisheng Liang
Robert M. Haralick
Publication venue
Publication date: 01/01/1999
Field of study

This paper describes a text-line identification and segmentation technique that is probability based, where all probabilities are estimated from an extensive training set of various kind of measurements of distances between the terminal and non-terminal entities and between the textline and the text-block entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line segmentation algorithm. On the UW-III database of some 1600 scanned document image pages, having some 105,020 text lines, the algorithm identifies and segments 104,773 correctly, an accuracy of 99.76%. 1. Introduction Given a document image, the end result of a document segmentation algorithm, in general, produces a hierarchical structure that captures the physical layout and the logical meaning of the input document page. The top of the hierarchical structure presents the entire page, and the bottom of the structure includes all glyphs on the documen..

CiteSeerX

Crossref