Search CORE

7 research outputs found

Finding Similarities between Structured Documents as a Crucial Stage for Generic Structured Document Classifier

Author: Mohamed Azlinah Hj.
Mokayed Hamam
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 28/05/2013
Field of study

One of the addressed problems of classifying structured documents is the definition of a similarity measure that is applicable in real situations, where query documents are allowed to differ from the database templates. Furthermore, this approach might have rotated [1], noise corrupted [2], or manually edited form and documents as test sets using different schemes, making direct comparison crucial issue [3]. Another problem is huge amount of forms could be written in different languages, for example here in Malaysia forms could be written in Malay, Chinese, English, etc languages. In that case text recognition (like OCR) could not be applied in order to classify the requested documents taking into consideration that OCR is considered more easier and accurate rather than the layout detection. Keywords: Feature Extraction, Document processing, Document Classification

International Institute for Science, Technology and Education (IISTE): E-Journals

Optimization of the Gaussian Kernel Extended by Binary Morphology for Text Line Segmentation

Author: Brodic D.
Milivojevic Z.
Publication venue: Společnost pro radioelektronické inženýrství
Publication date: 01/12/2010
Field of study

In this paper, an approach for text line segmentation by algorithm with the implementation of the Gaussian kernel is presented. As a result of algorithm, the growing area around text is exploited for text line segmentation. To improve text line segmentation process, isotropic Gaussian kernel is extended by dilatation. Furthermore, algorithms with isotropic and extended Gaussian kernels are examined and evaluated under different text samples. Results are given and comparative analysis is made for these algorithms. From the obtained results, optimization of the parameters defining extended Gaussian kernel dimension is proposed. The presented algorithm with the extended Gaussian kernel showed robustness for different types of text samples

Directory of Open Access Journals

Digital library of Brno University of Technology

A Skew Detection Technique Suitable for Degraded Ancient Manuscripts

Author
Publication venue: Budapest : Archeaeolingua
Publication date: 01/01/2011
Field of study

Publikationsserver der Universität Tübingen

Recognition and identification of form document layouts

Author: Luo Kai
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/2003
Field of study

In this thesis, a hierarchical tree representation is introduced to represent the logical structure of a form document. But different forms might have the same logical structure, so the representation will be ambiguous. In this thesis, an improvement is proposed to solve the ambiguity problem by using the physical information of the blocks. To fulfill the application of hierarchical tree representation and extract the physical information of blocks, a pixel tracing approach is used to extract form layout structures from form documents. Compared with Hough transform, the pixel tracing algorithm requires less computation. This algorithm has been tested on 50 different table forms. It effectively extracts all the line information required for the hierarchical tree representation, represents the form by a hierarchical tree, and distinguishes the different forms. The algorithm applies to table form documents

University of Nevada, Las Vegas Repository

Improving Digital Library Support for Historic Newspaper Collections

Author: Lin Leo
Publication venue: The University of Waikato
Publication date: 01/01/2009
Field of study

DVD-ROM Appendix available with the print copy of this thesis.National and international initiatives are underway around the globe to digitise the vast treasure troves of historical artefacts they contain and make them available as digital libraries (DLs). The developed DLs are often constructed from facsimile pages with pre-existing metadata, such as historic newspapers stored on microfiche or generated from the non-destructive scanning of precious manuscripts. Access to the source documents is therefore limited to methods constructed from the metadata. Other projects look to introduce full-text indexing through the application of off-the-shelf commercial Optical Character Recognition (OCR) software. While this has greater potential for the end user experience over the metadata-only versions, the approach currently taken is best effort in the time available rather than a process informed by detailed analysis of the issues. In this thesis, we investigate if a richer level of support and service can be achieved by more closely integrating image processing techniques with DL software. The thesis presents a variety of experiments, implemented within the recently published open-source OCR System (Ocropus). In particular, existing segmentation algorithms are compared against our own based on Hough Transform, using our own created corpus gathered from different major online digital historic newspaper archives

Research Commons@Waikato

Article Segmentation in Digitised Newspapers

Author: Naoum Andrew
Publication venue: Faculty of Engineering and Information Technologies, School of Computer Science
Publication date: 01/01/2020
Field of study

Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

Sydney eScholarship

Συμβολή στην ανάλυση και κωδικοποίηση συστοιχίας εικόνων τρισδιάστατης απεικόνισης

Author: Σγούρος Νικόλαος Π.
Publication venue
Publication date: 01/01/2007
Field of study

Τα τρισδιάστατα (3Δ) συστήματα απεικόνισης αποτελούν σήμερα το κύριο μέσο παρατήρησης για ένα πλήθος από εξειδικευμένες εφαρμογές και με την εξέλιξη των τεχνολογικών τους παραμέτρων και των δικτυακών υποδομών αναμένεται να αποτελέσουν στο άμεσο μέλλον την κύρια μέθοδο απεικόνισης για ένα ακόμη μεγαλύτερο πλήθος από καθημερινές εφαρμογές. Η έρευνα που πραγματοποιήθηκε στα πλαίσια της παρούσας διατριβής αποτελεί μία προχωρημένη μελέτη για ένα συγκεκριμένο είδος μεθόδου 3Δ απεικόνισης που ονομάζεται Ολοκληρωτική Φωτογράφιση (Ιntegral Photography - IP). Στο πρώτο τμήμα της μελέτης εξετάστηκαν οι δυνατότητες της μεθόδου και αναπτύχθηκε ένα πρωτότυπο ψηφιακό σύστημα καταγραφής εικόνων Ολοκληρωτικής Φωτογράφισης (ΟΦ) πραγματικών αντικειμένων του εγγύς πεδίου της συσκευής με χρήση ενός επίπεδου σαρωτή, ικανό να παράγει εικόνες με ιδιαίτερα υψηλή ανάλυση, σε σχέση με τα μέχρι τούδε προταθέντα ψηφιακά συστήματα. Στο δεύτερο τμήμα της παρούσας έρευνας αναπτύχθηκε, για πρώτη φορά, ένα αυτόματο σύστημα ευθυγράμμισης των αισθητήρων που χρησιμοποιούνται με τα οπτικά μέρη του συστήματος, το οποίο δεν προϋποθέτει καμία γνώση για τα χαρακτηριστικά του συστήματος χρησιμοποιώντας ένα πλήθος τεχνικών ανάλυσης εικόνας και αναγνώρισης προτύπων. Η παρούσα έρευνα ολοκληρώνεται με την ανάπτυξη εξειδικευμένων αλγορίθμων κωδικοποίησης των εικόνων ΟΦ, οι οποίες καταφέρνουν να μειώσουν σε εξαιρετικό βαθμό τον εγγενή πλεονασμό που περιέχουν αυτές

Digital Repository of Hellenic Managing Authority of the Operational Programme "Education and Lifelong Learning" (EDULLL)