Search CORE

86,594 research outputs found

Searching for Ground Truth: a stepping stone in automating genre classification

Author: A. Finn
D. Biber
G. Giuffrida
H.I. Witten
J. Karlgren
L. Breiman
M.P. Marcus
S.W. Ke
Y. Kim
Y. Kim
Publication venue
Publication date: 01/01/2007
Field of study

This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.

Crossref

Enlighten

Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents

Author: Chaudhuri B. B.
Javed Mohammed
Nagabhushan P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Document Image Analysis, like any Digital Image Analysis requires identification and extraction of proper features, which are generally extracted from uncompressed images, though in reality images are made available in compressed form for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation induces the motivation to research in extracting features directly from the compressed image. In this research, we propose to extract essential features such as projection profile, run-histogram and entropy for text document analysis directly from run-length compressed text-documents. The experimentation illustrates that features are extracted directly from the compressed image without going through the stage of decompression, because of which the computing time is reduced. The feature values so extracted are exactly identical to those extracted from uncompressed images.Comment: Published by IEEE in Proceedings of ACPR-2013. arXiv admin note: text overlap with arXiv:1403.778

arXiv.org e-Print Archive

Crossref

COMPARISON OF IMAGE SEGMENTATION METHOD IN IMAGE CHARACTER EXTRACTION PREPROCESSING USING OPTICAL CHARACTER RECOGINITON

Author: Anggraeni Dessy Tri
Wibawa Condro
Publication venue: Infinite Corporation
Publication date: 26/06/2023
Field of study

Today, there are many documents in the form of digital images obtained from various sources which must be able to be processed by a computer automatically. One of the document image processing is text feature extraction using OCR (Optical Character Recognition) technology. However, in many cases OCR technology are unable to read text characters in digital images accurately. This could be due to several factor such as poor image quality or noise. In order to get accurate result, the image must be in a good quality, so that digital image need to be preprocessed. The image preprocessing method used in this study are Otsu Thressholding Binarization, Niblack, and Sauvola methods. While the OCR technology used to extract the character is Tesseract library in Python. The test results show that direct text extraction from the original image gives better results with a character match rate average of 77.27%. Meanwhile, the match rate using the Otsu Thressholding method was 70.27%, the Sauvola method was 69.67%, and the Niblack method was only 35.72%. However, in some cases in this research the Sauvola and Otsu methods give better results

Jurnal Teknik Informatika (JUTIF)

Binarisation Algorithms Analysis on Document and Natural Scene Images

Author: Dharam Veer Sharma, Sukhdev Singh
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/08/2015
Field of study

The binarisation plays an important role in a system for text extraction from images which is a prominent area in digital image processing. The primary goal of the binarisation techniques are to covert colored and gray scale image into black and white image so that overall computational overhead can be minimized. It has great impact on performance of the system for text extraction from image. Such system has number of applications like navigation system for visually impaired persons, automatic text extraction from document images, and number plate detection to enforcement traffic rules etc. The present study analysed the performance of well known binarisation algorithms on degraded documents and camera captured images. The statistical parameters namely Precession, Recall and F-measure and PSNR are used to evaluate the performance. To find the suitability of the binarisation method for text preservation in natural scene images, we have also considered visual observation DOI: 10.17762/ijritcc2321-8169.15083

International Journal on Recent and Innovation Trends in Computing and Communication

Pattern Spotting and Image Retrieval in Historical Documents using Deep Hashing

Author: Barddal Jean P.
Britto Jr. Alceu de S.
Dias Caio da S.
Heutte Laurent
Koerich Alessandro L.
Publication venue
Publication date: 03/08/2022
Field of study

This paper presents a deep learning approach for image retrieval and pattern spotting in digital collections of historical documents. First, a region proposal algorithm detects object candidates in the document page images. Next, deep learning models are used for feature extraction, considering two distinct variants, which provide either real-valued or binary code representations. Finally, candidate images are ranked by computing the feature similarity with a given input query. A robust experimental protocol evaluates the proposed approach considering each representation scheme (real-valued and binary code) on the DocExplore image database. The experimental results show that the proposed deep models compare favorably to the state-of-the-art image retrieval approaches for images of historical documents, outperforming other deep models by 2.56 percentage points using the same techniques for pattern spotting. Besides, the proposed approach also reduces the search time by up to 200x and the storage cost up to 6,000x when compared to related works based on real-valued representations.Comment: 7 page

arXiv.org e-Print Archive

Detecting Family Resemblance: Automated Genre Classification.

Author: Kim Dr Yunhyong
Ross Seamus
Publication venue
Publication date: 01/01/2006
Field of study

This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.

Crossref

Directory of Open Access Journals

Enlighten

Automating Metadata Extraction: Genre Classification

Author: Kim Dr Yunhyong
Ross Seamus
Publication venue
Publication date: 01/01/2006
Field of study

A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.

Semantics-Based Content Extraction in Typewritten Historical Documents

Author: Antonacopoulos Apostolos
Karatzas Dimosthenis
Publication venue
Publication date: 01/01/2005
Field of study

This paper presents a flexible approach to extracting content from scanned historical documents using semantic information. The final electronic document is the result of a "digital historical document lifecycle" process, where the expert knowledge of the historian/archivist user is incorporated at different stages. Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior (in a variety of significant ways) results than document analysis and understanding techniques devised for contemporary documents

Southampton (e-Prints Soton)