696 research outputs found
Partitioning of the Degradation Space for OCR Training
Generally speaking optical character recognition algorithms tend to perform better when presented with homogeneous data. This paper studies a method that is designed to increase the homogeneity of training data, based on an understanding of the types of degradations that occur during the printing and scanning process, and how these degradations affect the homogeneity of the data. While it has been shown that dividing the degradation space by edge spread improves recognition accuracy over dividing the degradation space by threshold or point spread function width alone, the challenge is in deciding how many partitions and at what value of edge spread the divisions should be made. Clustering of different types of character features, fonts, sizes, resolutions and noise levels shows that edge spread is indeed shown to be a strong indicator of the homogeneity of character data clusters
Text Degradations and OCR Training
Printing and scanning of text documents introduces degradations to the characters which can be modeled. Interestingly, certain combinations of the parameters that govern the degradations introduced by the printing and scanning process affect characters in such a way that the degraded characters have a similar appearance, while other degradations leave the characters with an appearance that is very different. It is well known that (generally speaking) a test set that more closely matches a training set will be recognized with higher accuracy than one that matches the training set less well. Likewise, classifiers tend to perform better on data sets that have lower variance. This paper explores an analytical method that uses a formal printer/scanner degradation model to identify the similarity between groups of degraded characters. This similarity is shown to improve the recognition accuracy of a classifier through model directed choice of training set data
Human Image Preference and Document Degradation Models
Because most degraded documents are created by people, the preferences individuals have in relation to degraded documents are quite important. Their preferences may determine whether or not the documents they created are appropriate for machines. The goal of this study was to find relationships between preference and several parameters of a scanner degradation model. It was found that the difference in binarization threshold and the difference in edge displacement caused by the degradation both had strong linear relationships to preference. The width of the point spread function did not show such a relationship. These relationships were counterintuitive because degraded characters with thicker stroke widths than the original were preferred to those that had stroke widths closer to the original character
WordSup: Exploiting Word Annotations for Character based Text Detection
Imagery texts are usually organized as a hierarchy of several visual
elements, i.e. characters, words, text lines and text blocks. Among these
elements, character is the most basic one for various languages such as
Western, Chinese, Japanese, mathematical expression and etc. It is natural and
convenient to construct a common text detection engine based on character
detectors. However, training character detectors requires a vast of location
annotated characters, which are expensive to obtain. Actually, the existing
real text datasets are mostly annotated in word or line level. To remedy this
dilemma, we propose a weakly supervised framework that can utilize word
annotations, either in tight quadrangles or the more loose bounding boxes, for
character detector training. When applied in scene text detection, we are thus
able to train a robust character detector by exploiting word annotations in the
rich large-scale real scene text datasets, e.g. ICDAR15 and COCO-text. The
character detector acts as a key role in the pipeline of our text detection
engine. It achieves the state-of-the-art performance on several challenging
scene text detection benchmarks. We also demonstrate the flexibility of our
pipeline by various scenarios, including deformed text detection and math
expression recognition.Comment: 2017 International Conference on Computer Visio
Predictive maintenance: a novel framework for a data-driven, semi-supervised, and partially online prognostic health management application in industries
Prognostic Health Management (PHM) is a predictive maintenance strategy, which is based on Condition Monitoring (CM) data and aims to predict the future states of machinery. The existing literature reports the PHM at two levels: methodological and applicative. From the methodological point of view, there are many publications and standards of a PHM system design. From the applicative point of view, many papers address the improvement of techniques adopted for realizing PHM tasks without covering the whole process. In these cases, most applications rely on a large amount of historical data to train models for diagnostic and prognostic purposes. Industries, very often, are not able to obtain these data. Thus, the most adopted approaches, based on batch and off-line analysis, cannot be adopted. In this paper, we present a novel framework and architecture that support the initial application of PHM from the machinery producers’ perspective. The proposed framework is based on an edge-cloud infrastructure that allows performing streaming analysis at the edge to reduce the quantity of the data to store in permanent memory, to know the health status of the machinery at any point in time, and to discover novel and anomalous behaviors. The collection of the data from multiple machines into a cloud server allows training more accurate diagnostic and prognostic models using a higher amount of data, whose results will serve to predict the health status in real-time at the edge. The so-built PHM system would allow industries to monitor and supervise a machinery network placed in different locations and can thus bring several benefits to both machinery producers and users. After a brief literature review of signal processing, feature extraction, diagnostics, and prognostics, including incremental and semi-supervised approaches for anomaly and novelty detection applied to data streams, a case study is presented. It was conducted on data collected from a test rig and shows the potential of the proposed framework in terms of the ability to detect changes in the operating conditions and abrupt faults and storage memory saving. The outcomes of our work, as well as its major novel aspect, is the design of a framework for a PHM system based on specific requirements that directly originate from the industrial field, together with indications on which techniques can be adopted to achieve such goals
A Multiple-Expert Binarization Framework for Multispectral Images
In this work, a multiple-expert binarization framework for multispectral
images is proposed. The framework is based on a constrained subspace selection
limited to the spectral bands combined with state-of-the-art gray-level
binarization methods. The framework uses a binarization wrapper to enhance the
performance of the gray-level binarization. Nonlinear preprocessing of the
individual spectral bands is used to enhance the textual information. An
evolutionary optimizer is considered to obtain the optimal and some suboptimal
3-band subspaces from which an ensemble of experts is then formed. The
framework is applied to a ground truth multispectral dataset with promising
results. In addition, a generalization to the cross-validation approach is
developed that not only evaluates generalizability of the framework, it also
provides a practical instance of the selected experts that could be then
applied to unseen inputs despite the small size of the given ground truth
dataset.Comment: 12 pages, 8 figures, 6 tables. Presented at ICDAR'1
Degradation Specific OCR
Optical Character Recognition (OCR) is the mechanical or electronic translation of scanned images of handwritten, typewritten, or printed text into machine-encoded text. OCR has many applications, such as enabling a text document in a physical form to be editable, or enabling computer searching on a computer of a text that was initially in printed form. OCR engines are widely used to digitize text documents so that they can be digitally stored for remote access, mainly for websites. This facilitates the availability of these invaluable resources instantly, no matter the geographical location of the end user. Huge OCR misclassification errors can occur when an OCR engine is used to digitize a document that is degraded. The degradation may be due to varied reasons, including aging of the paper, incomplete printed characters, and blots of ink on the original document. In this thesis, the degradation due to scanning text documents was considered. To improve the OCR performance, it is vital to train the classifier on a large training set that has significant data points similar to the degraded real-life characters. In this thesis, characters with varying degrees of blurring and binarization thresholds were generated and they were used to calculate Edge Spread degradation parameters. These parameters were then used to divide the training data set of the OCR engine into more homogeneous sets. The resulting classification accuracy by training on these smaller sets was analyzed.
The training data set consisted of 100,000 data points of 300 DPI, 12 point Sans Serif font lowercase characters ‘c and ‘e’. These characters were generated with random values of threshold and blur width with random Gaussian noise added. To group the similar degraded characters together, clustering was performed using the Isodata clustering algoirithm. The two edge-spread parameters, one calculated on isolated edges named DC, one calculated on edges in close proximity accounting for interference effects, named MDC, were estimated to fit the cluster boundaries. These values were then used to divide the training data and a Bayesian classifier was used for recognition. It was verified that MDC is slightly better than DC as a division parameter. A choice of either 2 or 3 partitions was found to be the best choice for dataset division. An experimental way to estimate the best boundary to divide the data set was determined and tests were conducted that verified it.
Both crisp and fuzzy approaches for classifier training and testing were implemented and various combinations were tried with the crisp training and fuzzy testing being the best approach, giving a 98.08% classification rate for the data set divided into 2 partitions and 98.93% classification rate for the data set divided into 3 partitions in comparison to 94.08% for the classification of the data set with no divisions
Character Recognition
Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field
- …