10,533 research outputs found

    Information Preserving Processing of Noisy Handwritten Document Images

    Get PDF
    Many pre-processing techniques that normalize artifacts and clean noise induce anomalies due to discretization of the document image. Important information that could be used at later stages may be lost. A proposed composite-model framework takes into account pre-printed information, user-added data, and digitization characteristics. Its benefits are demonstrated by experiments with statistically significant results. Separating pre-printed ruling lines from user-added handwriting shows how ruling lines impact people\u27s handwriting and how they can be exploited for identifying writers. Ruling line detection based on multi-line linear regression reduces the mean error of counting them from 0.10 to 0.03, 6.70 to 0.06, and 0.13 to 0.02, com- pared to an HMM-based approach on three standard test datasets, thereby reducing human correction time by 50%, 83%, and 72% on average. On 61 page images from 16 rule-form templates, the precision and recall of form cell recognition are increased by 2.7% and 3.7%, compared to a cross-matrix approach. Compensating for and exploiting ruling lines during feature extraction rather than pre-processing raises the writer identification accuracy from 61.2% to 67.7% on a 61-writer noisy Arabic dataset. Similarly, counteracting page-wise skew by subtracting it or transforming contours in a continuous coordinate system during feature extraction improves the writer identification accuracy. An implementation study of contour-hinge features reveals that utilizing the full probabilistic probability distribution function matrix improves the writer identification accuracy from 74.9% to 79.5%

    Handling Massive N-Gram Datasets Efficiently

    Get PDF
    This paper deals with the two fundamental problems concerning the handling of large n-gram language models: indexing, that is compressing the n-gram strings and associated satellite data without compromising their retrieval speed; and estimation, that is computing the probability distribution of the strings from a large textual source. Regarding the problem of indexing, we describe compressed, exact and lossless data structures that achieve, at the same time, high space reductions and no time degradation with respect to state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word following a context of fixed length k, i.e., its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models, that have emerged as the de-facto choice for language modeling in both academia and industry, thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step thanks to exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5X on the total running time of the state-of-the-art approach.Comment: Published in ACM Transactions on Information Systems (TOIS), February 2019, Article No: 2

    Automatic handwriter identification using advanced machine learning

    Get PDF
    Handwriter identification a challenging problem especially for forensic investigation. This topic has received significant attention from the research community and several handwriter identification systems were developed for various applications including forensic science, document analysis and investigation of the historical documents. This work is part of an investigation to develop new tools and methods for Arabic palaeography, which is is the study of handwritten material, particularly ancient manuscripts with missing writers, dates, and/or places. In particular, the main aim of this research project is to investigate and develop new techniques and algorithms for the classification and analysis of ancient handwritten documents to support palaeographic studies. Three contributions were proposed in this research. The first is concerned with the development of a text line extraction algorithm on colour and greyscale historical manuscripts. The idea uses a modified bilateral filtering approach to adaptively smooth the images while still preserving the edges through a nonlinear combination of neighboring image values. The proposed algorithm aims to compute a median and a separating seam and has been validated to deal with both greyscale and colour historical documents using different datasets. The results obtained suggest that our proposed technique yields attractive results when compared against a few similar algorithms. The second contribution proposes to deploy a combination of Oriented Basic Image features and the concept of graphemes codebook in order to improve the recognition performances. The proposed algorithm is capable to effectively extract the most distinguishing handwriter’s patterns. The idea consists of judiciously combining a multiscale feature extraction with the concept of grapheme to allow for the extraction of several discriminating features such as handwriting curvature, direction, wrinkliness and various edge-based features. The technique was validated for identifying handwriters using both Arabic and English writings captured as scanned images using the IAM dataset for English handwriting and ICFHR 2012 dataset for Arabic handwriting. The results obtained clearly demonstrate the effectiveness of the proposed method when compared against some similar techniques. The third contribution is concerned with an offline handwriter identification approach based on the convolutional neural network technology. At the first stage, the Alex-Net architecture was employed to learn image features (handwritten scripts) and the features obtained from the fully connected layers of the model. Then, a Support vector machine classifier is deployed to classify the writing styles of the various handwriters. In this way, the test scripts can be classified by the CNN training model for further classification. The proposed approach was evaluated based on Arabic Historical datasets; Islamic Heritage Project (IHP) and Qatar National Library (QNL). The obtained results demonstrated that the proposed model achieved superior performances when compared to some similar method

    Advances in Character Recognition

    Get PDF
    This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject

    OpenLB User Guide: Associated with Release 1.6 of the Code

    Full text link
    OpenLB is an object-oriented implementation of LBM. It is the first implementation of a generic platform for LBM programming, which is shared with the open source community (GPLv2). Since the first release in 2007, the code has been continuously improved and extended which is documented by thirteen releases as well as the corresponding release notes which are available on the OpenLB website (https://www.openlb.net). The OpenLB code is written in C++ and is used by application programmers as well as developers, with the ability to implement custom models OpenLB supports complex data structures that allow simulations in complex geometries and parallel execution using MPI, OpenMP and CUDA on high-performance computers. The source code uses the concepts of interfaces and templates, so that efficient, direct and intuitive implementations of the LBM become possible. The efficiency and scalability has been checked and proved by code reviews. This user manual and a source code documentation by DoxyGen are available on the OpenLB project website

    Bayesian hierarchical modeling for the forensic evaluation of handwritten documents

    Get PDF
    The analysis of handwritten evidence has been used widely in courts in the United States since the 1930s (Osborn, 1946). Traditional evaluations are conducted by trained forensic examiners. More recently, there has been a movement toward objective and probability-based evaluation of evidence, and a variety of governing bodies have made explicit calls for research to support the scientific underpinnings of the field (National Research Council, 2009; President\u27s Council of Advisors on Science and Technology (US), 2016; National Institutes of Standards and Technology). This body of work makes contributions to help satisfy those needs for the evaluation of handwritten documents. We develop a framework to evaluate a questioned writing sample against a finite set of genuine writing samples from known sources. Our approach is fully automated, reducing the opportunity for cognitive biases to enter the analysis pipeline through regular examiner intervention. Our methods are able to handle all writing styles together, and result in estimated probabilities of writership based on parametric modeling. We contribute open-source datasets, code, and algorithms. A document is prepared for the evaluation processed by first being scanned and stored as an image file. The image is processed and the text within is decomposed into a sequence of disjoint graphical structures. The graphs serve as the smallest unit of writing we will consider, and features extracted from them are used as data for modeling. Chapter 2 describes the image processing steps and introduces a distance measure for the graphs. The distance measure is used in a K-means clustering algorithm (Forgy, 1965; Lloyd, 1982; Gan and Ng, 2017), which results in a clustering template with 40 exemplar structures. The primary feature we extract from each graph is a cluster assignment. We do so by comparing each graph to the template and making assignments based on the exemplar to which each graph is most similar in structure. The cluster assignment feature is used for a writer identification exercise using a Bayesian hierarchical model on a small set of 27 writers. In Chapter 3 we incorporate new data sources and a larger number of writers in the clustering algorithm to produce an updated template. A mixture component is added to the hierarchical model and we explore the relationship between a writer\u27s estimated mixing parameter and their writing style. In Chapter 4 we expand the hierarchical model to include other graph-based features, in addition to cluster assignments. We incorporate an angular feature with support on the polar coordinate system into the hierarchical modeling framework using a circular probability density function. The new model is applied and tested in three applications

    PC based storage and processing of electrocardiogram tracings recorded with a HP4745A pagewriter II cardiograph

    Get PDF
    ThesisCurrently the Department of Cardiology, Universitas Hospital, keeps paper copies of ECGs filed in large filing cabinets. Access to these files is tedious during office hours, and impossible after hours, when the filing room is locked and no filing personnel are available. Commercially available systems for computerised storage of ECG data are available from a number of vendors. Some drawbacks of these systems include: • Extremely expensive. • Only a portion of the functions offered by these systems are really needed at the Department of Cardiology, Universitas Hospital. These systems are thus not economically justifiable by the Department of Cardiology, Universitas Hospital. • Some require new/different ECG machines to be used. • Some require an expensive computer system to be installed. • Additional space is needed for additional equipment. • Staff needs to be extensively trained to use the new equipment. This dissertation describes the development of a dynamic link library (DLL) which is used to acquire and decode data from a Hewlet Packard HP4745A Cardiograph II Page Writer electrocardiograph. Furthermore, the database application using the HP4745A DLL can also be expanded to accept data from other ECG machines. The acquisition and decoding DLL must be developed to produce a decoded data file conforming to the format described in this dissertation. By storing these decoded data in a database such as Hearts 32, the data can be reprocessed (drawing of ECG traces on screen or on printer). Selected leads from different ECGs can also be plotted on the same screen. Fast access to previous ECGs will help the cardiologists at the Universitas Hospital in Bloemfontein to improve patient care. The cardiac patients of the Free State community as well as the staff at the Department of Cardiology, Universitas Hospital, Bloemfontein can benefit from the results of this research

    Query-Driven Global Graph Attention Model for Visual Parsing: Recognizing Handwritten and Typeset Math Formulas

    Get PDF
    We present a new visual parsing method based on standard Convolutional Neural Networks (CNNs) for handwritten and typeset mathematical formulas. The Query-Driven Global Graph Attention (QD-GGA) parser employs multi-task learning, using a single feature representation for locating, classifying, and relating symbols. QD-GGA parses formulas by first constructing a Line-Of-Sight (LOS) graph over the input primitives (e.g handwritten strokes or connected components in images). Second, class distributions for LOS nodes and edges are obtained using query-specific feature filters (i.e., attention) in a single feed-forward pass. This allows end-to-end structure learning using a joint loss over primitive node and edge class distributions. Finally, a Maximum Spanning Tree (MST) is extracted from the weighted graph using Edmonds\u27 Arborescence Algorithm. The model may be run recurrently over the input graph, updating attention to focus on symbols detected in the previous iteration. QD-GGA does not require additional grammar rules and the language model is learned from the sets of symbols/relationships and the statistics over them in the training set. We benchmark our system against both handwritten and typeset state-of-the-art math recognition systems. Our preliminary results show that this is a promising new approach for visual parsing of math formulas. Using recurrent execution, symbol detection is near perfect for both handwritten and typeset formulas: we obtain a symbol f-measure of over 99.4% for both the CROHME (handwritten) and INFTYMCCDB-2 (typeset formula image) datasets. Our method is also much faster in both training and execution than state-of-the-art RNN-based formula parsers. The unlabeled structure detection of QDGGA is competitive with encoder-decoder models, but QD-GGA symbol and relationship classification is weaker. We believe this may be addressed through increased use of spatial features and global context
    • …
    corecore