7 research outputs found

    Finding Similarities between Structured Documents as a Crucial Stage for Generic Structured Document Classifier

    Get PDF
    One of the addressed problems of classifying structured documents is the definition of a similarity measure that is applicable in real situations, where query documents are allowed to differ from the database templates. Furthermore, this approach might have rotated [1], noise corrupted [2], or manually edited form and documents as test sets using different schemes, making direct comparison crucial issue [3]. Another problem is huge amount of forms could be written in different languages, for example here in Malaysia forms could be written in Malay, Chinese, English, etc languages. In that case text recognition (like OCR) could not be applied in order to classify the requested documents taking into consideration that OCR is considered more easier and accurate rather than the layout  detection. Keywords: Feature Extraction, Document processing, Document Classification

    Optimization of the Gaussian Kernel Extended by Binary Morphology for Text Line Segmentation

    Get PDF
    In this paper, an approach for text line segmentation by algorithm with the implementation of the Gaussian kernel is presented. As a result of algorithm, the growing area around text is exploited for text line segmentation. To improve text line segmentation process, isotropic Gaussian kernel is extended by dilatation. Furthermore, algorithms with isotropic and extended Gaussian kernels are examined and evaluated under different text samples. Results are given and comparative analysis is made for these algorithms. From the obtained results, optimization of the parameters defining extended Gaussian kernel dimension is proposed. The presented algorithm with the extended Gaussian kernel showed robustness for different types of text samples

    A Skew Detection Technique Suitable for Degraded Ancient Manuscripts

    Get PDF

    Recognition and identification of form document layouts

    Full text link
    In this thesis, a hierarchical tree representation is introduced to represent the logical structure of a form document. But different forms might have the same logical structure, so the representation will be ambiguous. In this thesis, an improvement is proposed to solve the ambiguity problem by using the physical information of the blocks. To fulfill the application of hierarchical tree representation and extract the physical information of blocks, a pixel tracing approach is used to extract form layout structures from form documents. Compared with Hough transform, the pixel tracing algorithm requires less computation. This algorithm has been tested on 50 different table forms. It effectively extracts all the line information required for the hierarchical tree representation, represents the form by a hierarchical tree, and distinguishes the different forms. The algorithm applies to table form documents

    Improving Digital Library Support for Historic Newspaper Collections

    Get PDF
    DVD-ROM Appendix available with the print copy of this thesis.National and international initiatives are underway around the globe to digitise the vast treasure troves of historical artefacts they contain and make them available as digital libraries (DLs). The developed DLs are often constructed from facsimile pages with pre-existing metadata, such as historic newspapers stored on microfiche or generated from the non-destructive scanning of precious manuscripts. Access to the source documents is therefore limited to methods constructed from the metadata. Other projects look to introduce full-text indexing through the application of off-the-shelf commercial Optical Character Recognition (OCR) software. While this has greater potential for the end user experience over the metadata-only versions, the approach currently taken is best effort in the time available rather than a process informed by detailed analysis of the issues. In this thesis, we investigate if a richer level of support and service can be achieved by more closely integrating image processing techniques with DL software. The thesis presents a variety of experiments, implemented within the recently published open-source OCR System (Ocropus). In particular, existing segmentation algorithms are compared against our own based on Hough Transform, using our own created corpus gathered from different major online digital historic newspaper archives

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities

    Συμβολή στην ανάλυση και κωδικοποίηση συστοιχίας εικόνων τρισδιάστατης απεικόνισης

    Get PDF
    Τα τρισδιάστατα (3Δ) συστήματα απεικόνισης αποτελούν σήμερα το κύριο μέσο παρατήρησης για ένα πλήθος από εξειδικευμένες εφαρμογές και με την εξέλιξη των τεχνολογικών τους παραμέτρων και των δικτυακών υποδομών αναμένεται να αποτελέσουν στο άμεσο μέλλον την κύρια μέθοδο απεικόνισης για ένα ακόμη μεγαλύτερο πλήθος από καθημερινές εφαρμογές. Η έρευνα που πραγματοποιήθηκε στα πλαίσια της παρούσας διατριβής αποτελεί μία προχωρημένη μελέτη για ένα συγκεκριμένο είδος μεθόδου 3Δ απεικόνισης που ονομάζεται Ολοκληρωτική Φωτογράφιση (Ιntegral Photography - IP). Στο πρώτο τμήμα της μελέτης εξετάστηκαν οι δυνατότητες της μεθόδου και αναπτύχθηκε ένα πρωτότυπο ψηφιακό σύστημα καταγραφής εικόνων Ολοκληρωτικής Φωτογράφισης (ΟΦ) πραγματικών αντικειμένων του εγγύς πεδίου της συσκευής με χρήση ενός επίπεδου σαρωτή, ικανό να παράγει εικόνες με ιδιαίτερα υψηλή ανάλυση, σε σχέση με τα μέχρι τούδε προταθέντα ψηφιακά συστήματα. Στο δεύτερο τμήμα της παρούσας έρευνας αναπτύχθηκε, για πρώτη φορά, ένα αυτόματο σύστημα ευθυγράμμισης των αισθητήρων που χρησιμοποιούνται με τα οπτικά μέρη του συστήματος, το οποίο δεν προϋποθέτει καμία γνώση για τα χαρακτηριστικά του συστήματος χρησιμοποιώντας ένα πλήθος τεχνικών ανάλυσης εικόνας και αναγνώρισης προτύπων. Η παρούσα έρευνα ολοκληρώνεται με την ανάπτυξη εξειδικευμένων αλγορίθμων κωδικοποίησης των εικόνων ΟΦ, οι οποίες καταφέρνουν να μειώσουν σε εξαιρετικό βαθμό τον εγγενή πλεονασμό που περιέχουν αυτές
    corecore