44 research outputs found

    Automatic detection of change in address blocks for reply forms processing

    Get PDF
    In this paper, an automatic method to detect the presence of on-line erasures/scribbles/corrections/over-writing in the address block of various types of subscription and utility payment forms is presented. The proposed approach employs bottom-up segmentation of the address block. Heuristic rules based on structural features are used to automate the detection process. The algorithm is applied on a large dataset of 5,780 real world document forms of 200 dots per inch resolution. The proposed algorithm performs well with an average processing time of 108 milliseconds per document with a detection accuracy of 98.96%

    Penyelenggaraan struktur penahan cerun rock shed: langkah mitigasi runtuhan tanah di Simpang Pulai - Blue Valley, Perak

    Get PDF
    Industri pembinaan merupakan industri yang sangat mencabar bukan sahaja di Malaysia malah di seluruh dunia yang merangkumi skop 3D dirty, difficult and dangerous. Industri ini juga meruapakan antara penyumbang terbesar KDNK iaitu sebanyak 7.4 peratus pada tahun 2016, walaupun industri ini antara penyumbang terbesar dari aspek keselamatan iaitu kemalangan (CIDB, 2017). Justeru itu, pihak yang bertanggungjawab seharusnya memandang serius mengenai masalah-masalah yang dihadapi supaya industri ini mampu bersaing di peringkat antarabangsa

    A novel computer vision based method for PDF academic literature structure understanding

    Get PDF
    The PDF format plays a crucial role in the field of electronic academic literature publishing, but due to its complicated technical rules, PDF cannot be directly read by machines, which has caused a lot of inconvenience to the research work on academic literature. This poster proposes a computer vision-based PDF document structure understanding method. This method maps visual objects and text objects in PDF academic papers and obtains geometric and text attributes of content objects, supplemented by a heuristic algorithm. The algorithm performs type classification on the content object to obtain the physical structure and logical structure of the PDF document. This method overcomes the shortcomings of other PDF analysis methods that require a large number of artificial feature construction or large-scale corpus training, difficult to identify formula tables, and success-fully constructs a structure understanding and full-text extraction of ACM's collections

    Segmentation of Unstructured Newspaper Documents

    Full text link
    Document layout analysis is one of the important steps in automated document recognition systems. In Document layout analysis, meaningful information is retrieved from document images by identifying, categorizing and labeling the semantics of text blocks from the document images. In this paper, we present simple top-down approach for document page segmentation. We have tested the proposed method on unstructured documents like newspaper which is having complex structures having no fixed structure. Newspaper also has multiple titles and multiple columns. In the proposed method, white gap area which separates titles, columns of text, line of text and words in lines have been identified to separate document into various segments. The proposed algorithm has been successfully implemented and applied over a large number of Indian newspapers and the results have been evaluated by number of blocks detected and taking their correct ordering information into account

    Entropy Computation of Document Images in Run-Length Compressed Domain

    Full text link
    Compression of documents, images, audios and videos have been traditionally practiced to increase the efficiency of data storage and transfer. However, in order to process or carry out any analytical computations, decompression has become an unavoidable pre-requisite. In this research work, we have attempted to compute the entropy, which is an important document analytic directly from the compressed documents. We use Conventional Entropy Quantifier (CEQ) and Spatial Entropy Quantifiers (SEQ) for entropy computations [1]. The entropies obtained are useful in applications like establishing equivalence, word spotting and document retrieval. Experiments have been performed with all the data sets of [1], at character, word and line levels taking compressed documents in run-length compressed domain. The algorithms developed are computational and space efficient, and results obtained match 100% with the results reported in [1].Comment: Published in IEEE Proceedings 2014 Fifth International Conference on Signals and Image Processin

    Predicting semantic labels of text regions in heterogeneous document images

    Get PDF
    Contains fulltext : 214639.pdf (publisher's version ) (Open Access)KONVENS 2019: 15th Conference on Natural Language Processing, Erlangen, Germany, October 9-11, 201

    Document analysis of PDF files: methods, results and implications

    Get PDF
    A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding

    Identification of Technical Journals by Image Processing Techniques

    Get PDF
    The emphasis of this study is put on developing an automatic approach to identifying a given unknown technical journal from its cover page. Since journal cover pages contain a great deal of information, determining the title of an unknown journal using optical character recognition techniques seems difficult. Comparing the layout structures of text blocks on the journal cover pages is an effective method for distinguishing one journal from the other. In order to achieve efficient layout-structure comparison, a left-to-right hidden Markov model (HMM) is used to represent the layout structure of text blocks for each kind of journal. Accordingly, title determination of an input unknown journal can be effectively achieved by comparing the layout structure of the unknown journal to each HMM in the database. Besides, from the layout structure of the best matched HMM, we can locate the text block of the issue date, which will be recognized by OCR techniques for accomplishing an automatic journal registration system. Experimental results show the feasibility of the proposed approach

    Simple Character Recognition

    Get PDF
    Tato práce se zabývá vyhledáním a rozpoznáváním textu v obraze. Rozebírá problematiku extrakce příznaků a jejich použití při strojovém učení. Popisuje postup při návrhu a implementaci jednoduché aplikace pro rozpoznávání znaků strojově psaného textu.This work deals with the process of text location and recognition in an image document. It discusses the matter of feature extraction and its usage in machine learning. Portion of this work is devoted to design and implementation of application for simple character recognition of machine printed text.
    corecore