44 research outputs found
Automatic detection of change in address blocks for reply forms processing
In this paper, an automatic method to detect the presence of on-line erasures/scribbles/corrections/over-writing in the address block of various types of subscription and utility payment forms is presented. The proposed approach employs bottom-up segmentation of the address block. Heuristic rules based on structural features are used to automate the detection process. The algorithm is applied on a large dataset of 5,780 real world document forms of 200 dots per inch resolution. The proposed algorithm performs well with an average processing time of 108 milliseconds per document with a detection accuracy of 98.96%
Penyelenggaraan struktur penahan cerun rock shed: langkah mitigasi runtuhan tanah di Simpang Pulai - Blue Valley, Perak
Industri pembinaan merupakan industri yang sangat mencabar bukan sahaja di Malaysia malah di seluruh dunia yang merangkumi skop 3D dirty, difficult and dangerous. Industri ini juga meruapakan antara penyumbang terbesar KDNK iaitu sebanyak 7.4 peratus pada tahun 2016, walaupun industri ini antara penyumbang terbesar dari aspek keselamatan iaitu kemalangan (CIDB, 2017). Justeru itu, pihak yang bertanggungjawab seharusnya memandang serius mengenai masalah-masalah yang dihadapi supaya industri ini mampu bersaing di peringkat antarabangsa
A novel computer vision based method for PDF academic literature structure understanding
The PDF format plays a crucial role in the field of electronic academic literature publishing, but due to its complicated technical rules, PDF cannot be directly read by machines, which has caused a lot of inconvenience to the research work on academic literature. This poster proposes a computer vision-based PDF document structure understanding method. This method maps visual objects and text objects in PDF academic papers and obtains geometric and text attributes of content objects, supplemented by a heuristic algorithm. The algorithm performs type classification on the content object to obtain the physical structure and logical structure of the PDF document. This method overcomes the shortcomings of other PDF analysis methods that require a large number of artificial feature construction or large-scale corpus training, difficult to identify formula tables, and success-fully constructs a structure understanding and full-text extraction of ACM's collections
Segmentation of Unstructured Newspaper Documents
Document layout analysis is one of the important steps in automated document recognition systems. In Document layout analysis, meaningful information is retrieved from document images by identifying, categorizing and labeling the semantics of text blocks from the document images. In this paper, we present simple top-down approach for document page segmentation. We have tested the proposed method on unstructured documents like newspaper which is having complex structures having no fixed structure. Newspaper also has multiple titles and multiple columns. In the proposed method, white gap area which separates titles, columns of text, line of text and words in lines have been identified to separate document into various segments. The proposed algorithm has been successfully implemented and applied over a large number of Indian newspapers and the results have been evaluated by number of blocks detected and taking their correct ordering information into account
Entropy Computation of Document Images in Run-Length Compressed Domain
Compression of documents, images, audios and videos have been traditionally
practiced to increase the efficiency of data storage and transfer. However, in
order to process or carry out any analytical computations, decompression has
become an unavoidable pre-requisite. In this research work, we have attempted
to compute the entropy, which is an important document analytic directly from
the compressed documents. We use Conventional Entropy Quantifier (CEQ) and
Spatial Entropy Quantifiers (SEQ) for entropy computations [1]. The entropies
obtained are useful in applications like establishing equivalence, word
spotting and document retrieval. Experiments have been performed with all the
data sets of [1], at character, word and line levels taking compressed
documents in run-length compressed domain. The algorithms developed are
computational and space efficient, and results obtained match 100% with the
results reported in [1].Comment: Published in IEEE Proceedings 2014 Fifth International Conference on
Signals and Image Processin
Predicting semantic labels of text regions in heterogeneous document images
Contains fulltext :
214639.pdf (publisher's version ) (Open Access)KONVENS 2019: 15th Conference on Natural Language Processing, Erlangen, Germany, October 9-11, 201
Document analysis of PDF files: methods, results and implications
A strategy for document analysis is presented which uses Portable Document Format (PDF the underlying file structure for Adobe Acrobat software) as its starting point. This strategy examines the appearance and geometric position of text and image blocks distributed over an entire document. A blackboard system is used to tag the blocks as a first stage in deducing the fundamental relationships existing between them. PDF is shown to be a useful intermediate stage in the bottom-up analysis of document structure. Its information on line spacing and font usage gives important clues in bridging the semantic gap between the scanned bitmap page and its fully analysed, block-structured form. Analysis of PDF can yield not only accurate page decomposition but also sufficient document information for the later stages of structural analysis and document understanding
Identification of Technical Journals by Image Processing Techniques
The emphasis of this study is put on developing an automatic approach to identifying a given unknown technical journal from its cover page. Since journal cover pages contain a great deal of information, determining the title of an unknown journal using optical character recognition techniques seems difficult. Comparing the layout structures of text blocks on the journal cover pages is an effective method for distinguishing one journal from the other. In order to achieve efficient layout-structure comparison, a left-to-right hidden Markov model (HMM) is used to represent the layout structure of text blocks for each kind of journal. Accordingly, title determination of an input unknown journal can be effectively achieved by comparing the layout structure of the unknown journal to each HMM in the database. Besides, from the layout structure of the best matched HMM, we can locate the text block of the issue date, which will be recognized by OCR techniques for accomplishing an automatic journal registration system. Experimental results show the feasibility of the proposed approach
Simple Character Recognition
Tato práce se zabĂ˝vá vyhledánĂm a rozpoznávánĂm textu v obraze. RozebĂrá problematiku extrakce pĹ™ĂznakĹŻ a jejich pouĹľitĂ pĹ™i strojovĂ©m uÄŤenĂ. Popisuje postup pĹ™i návrhu a implementaci jednoduchĂ© aplikace pro rozpoznávánĂ znakĹŻ strojovÄ› psanĂ©ho textu.This work deals with the process of text location and recognition in an image document. It discusses the matter of feature extraction and its usage in machine learning. Portion of this work is devoted to design and implementation of application for simple character recognition of machine printed text.