4,711 research outputs found
Turkish handwritten text recognition: a case of agglutinative languages
We describe a system for recognizing unconstrained Turkish handwritten text. Turkish has agglutinative morphology and theoretically an infinite number of words that can be generated by adding more suffixes to the word. This makes lexicon-based recognition approaches, where the most likely word is selected among all the alternatives in a lexicon, unsuitable for Turkish. We describe our approach to the problem using a Turkish prefix recognizer. First results of the system demonstrates the promise of this approach, with top-10 word recognition rate of about 40% for a small test data of mixed handprint and cursive writing. The lexicon-based approach with a 17,000 word-lexicon (with test words added) achieves 56% top-10 word recognition rate
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
A Bottom Up Procedure for Text Line Segmentation of Latin Script
In this paper we present a bottom up procedure for segmentation of text lines
written or printed in the Latin script. The proposed method uses a combination
of image morphology, feature extraction and Gaussian mixture model to perform
this task. The experimental results show the validity of the procedure.Comment: Accepted and presented at the IEEE conference "International
Conference on Advances in Computing, Communications and Informatics (ICACCI)
2017
Handwritten and Printed Text Separation in Real Document
The aim of the paper is to separate handwritten and printed text from a real
document embedded with noise, graphics including annotations. Relying on
run-length smoothing algorithm (RLSA), the extracted pseudo-lines and
pseudo-words are used as basic blocks for classification. To handle this, a
multi-class support vector machine (SVM) with Gaussian kernel performs a first
labelling of each pseudo-word including the study of local neighbourhood. It
then propagates the context between neighbours so that we can correct possible
labelling errors. Considering running time complexity issue, we propose linear
complexity methods where we use k-NN with constraint. When using a kd-tree, it
is almost linearly proportional to the number of pseudo-words. The performance
of our system is close to 90%, even when very small learning dataset where
samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013
- …