Search CORE

44 research outputs found

Automatic detection of change in address blocks for reply forms processing

Author: Gray Alison J.
Karthick Keerambur Ramaswami
Marshall Stephen
United Kingdom's Knowledge Transfer Partnership (KTP) Program (Funder)
Publication venue: International Association of Engineers
Publication date: 01/01/2008
Field of study

In this paper, an automatic method to detect the presence of on-line erasures/scribbles/corrections/over-writing in the address block of various types of subscription and utility payment forms is presented. The proposed approach employs bottom-up segmentation of the address block. Heuristic rules based on structural features are used to automate the detection process. The algorithm is applied on a large dataset of 5,780 real world document forms of 200 dots per inch resolution. The proposed algorithm performs well with an average processing time of 108 milliseconds per document with a detection accuracy of 98.96%

University of Strathclyde Institutional Repository

Penyelenggaraan struktur penahan cerun rock shed: langkah mitigasi runtuhan tanah di Simpang Pulai - Blue Valley, Perak

Author: Abu Bakar Nor Athirah
Goh Kai Chen
Masrom Md Asrul Nasid
Publication venue: 'Penerbit UTHM'
Publication date: 01/01/2020
Field of study

Industri pembinaan merupakan industri yang sangat mencabar bukan sahaja di Malaysia malah di seluruh dunia yang merangkumi skop 3D dirty, difficult and dangerous. Industri ini juga meruapakan antara penyumbang terbesar KDNK iaitu sebanyak 7.4 peratus pada tahun 2016, walaupun industri ini antara penyumbang terbesar dari aspek keselamatan iaitu kemalangan (CIDB, 2017). Justeru itu, pihak yang bertanggungjawab seharusnya memandang serius mengenai masalah-masalah yang dihadapi supaya industri ini mampu bersaing di peringkat antarabangsa

UTHM Institutional Repository

A novel computer vision based method for PDF academic literature structure understanding

Author: Lu Wei
Yu Fengchang
Publication venue: 'iSchools'
Publication date: 15/03/2019
Field of study

The PDF format plays a crucial role in the field of electronic academic literature publishing, but due to its complicated technical rules, PDF cannot be directly read by machines, which has caused a lot of inconvenience to the research work on academic literature. This poster proposes a computer vision-based PDF document structure understanding method. This method maps visual objects and text objects in PDF academic papers and obtains geometric and text attributes of content objects, supplemented by a heuristic algorithm. The algorithm performs type classification on the content object to obtain the physical structure and logical structure of the PDF document. This method overcomes the shortcomings of other PDF analysis methods that require a large number of artificial feature construction or large-scale corpus training, difficult to identify formula tables, and success-fully constructs a structure understanding and full-text extraction of ACM's collections

Illinois Digital Environment for Access to Learning and Scholarship Repository

Segmentation of Unstructured Newspaper Documents

Author: Dinesh R. (R)
Naik S. (Santosh)
S P. (Prabhanjan)
Publication venue: 'Arunai Publications Private Limited'
Publication date: 01/05/2017
Field of study

Document layout analysis is one of the important steps in automated document recognition systems. In Document layout analysis, meaningful information is retrieved from document images by identifying, categorizing and labeling the semantics of text blocks from the document images. In this paper, we present simple top-down approach for document page segmentation. We have tested the proposed method on unstructured documents like newspaper which is having complex structures having no fixed structure. Newspaper also has multiple titles and multiple columns. In the proposed method, white gap area which separates titles, columns of text, line of text and words in lines have been identified to separate document into various segments. The proposed algorithm has been successfully implemented and applied over a large number of Indian newspapers and the results have been evaluated by number of blocks detected and taking their correct ordering information into account

Neliti

Entropy Computation of Document Images in Run-Length Compressed Domain

Author: Chaudhuri B. B.
Javed Mohammed
Nagabhushan P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/04/2014
Field of study

Compression of documents, images, audios and videos have been traditionally practiced to increase the efficiency of data storage and transfer. However, in order to process or carry out any analytical computations, decompression has become an unavoidable pre-requisite. In this research work, we have attempted to compute the entropy, which is an important document analytic directly from the compressed documents. We use Conventional Entropy Quantifier (CEQ) and Spatial Entropy Quantifiers (SEQ) for entropy computations [1]. The entropies obtained are useful in applications like establishing equivalence, word spotting and document retrieval. Experiments have been performed with all the data sets of [1], at character, word and line levels taking compressed documents in run-length compressed domain. The algorithms developed are computational and space efficient, and results obtained match 100% with the results reported in [1].Comment: Published in IEEE Proceedings 2014 Fifth International Conference on Signals and Image Processin

arXiv.org e-Print Archive

Crossref

Predicting semantic labels of text regions in heterogeneous document images

Author: Enendu S.
Evert S.
Hiemstra D.
Scholtes J.C.
Smeets J.B.J.
Theune M.
Publication venue
Publication date: 01/01/2019
Field of study

Contains fulltext : 214639.pdf (publisher's version ) (Open Access)KONVENS 2019: 15th Conference on Natural Language Processing, Erlangen, Germany, October 9-11, 201

Maastricht University Research Portal

Radboud Repository

University of Twente Research Information

Document analysis of PDF files: methods, results and implications

Author: Brailsford David F.
Lovegrove William S.
Publication venue: John Wiley Ltd
Publication date: 01/01/1995
Field of study

A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding

Nottingham eTheses

Identification of Technical Journals by Image Processing Techniques

Author: Wang Ling Ling
Wen Pei Chun
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 26/01/2012
Field of study

The emphasis of this study is put on developing an automatic approach to identifying a given unknown technical journal from its cover page. Since journal cover pages contain a great deal of information, determining the title of an unknown journal using optical character recognition techniques seems difficult. Comparing the layout structures of text blocks on the journal cover pages is an effective method for distinguishing one journal from the other. In order to achieve efficient layout-structure comparison, a left-to-right hidden Markov model (HMM) is used to represent the layout structure of text blocks for each kind of journal. Accordingly, title determination of an input unknown journal can be effectively achieved by comparing the layout structure of the unknown journal to each HMM in the database. Besides, from the layout structure of the best matched HMM, we can locate the text block of the issue date, which will be recognized by OCR techniques for accomplishing an automatic journal registration system. Experimental results show the feasibility of the proposed approach

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Simple Character Recognition

Author: Hamrský Jan
Publication venue: Vysoké učení technické v Brně. Fakulta informačních technologií
Publication date: 01/01/2011
Field of study

Tato práce se zabývá vyhledáním a rozpoznáváním textu v obraze. Rozebírá problematiku extrakce příznaků a jejich použití při strojovém učení. Popisuje postup při návrhu a implementaci jednoduché aplikace pro rozpoznávání znaků strojově psaného textu.This work deals with the process of text location and recognition in an image document. It discusses the matter of feature extraction and its usage in machine learning. Portion of this work is devoted to design and implementation of application for simple character recognition of machine printed text.

Digital library of Brno University of Technology

National Repository of Grey Literature