26 research outputs found

    Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections

    Get PDF
    A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying metadata fields by inspecting the document is easy for a human. The visual cues in the formatting of the document along with accumulated knowledge and intelligence make it easy for a human to identify various metadata fields. Even with the best possible automated procedures, numerous sources of error exist, including some that cannot be controlled, such as scanned documents with text obscured by smudges, signatures, or stamps. A commercially viable process for metadata extraction must remain robust in the presence of these external sources of error as well as in the face of the uncertainty that accompanies any attempts to automate intelligent behavior. While extraction accuracy and completeness must be the primary goal of an extraction system, the ability to detect and report questionable results is equally important for a production quality system, since it promotes confidence in the system. We have developed and demonstrated a novel system for extracting metadata. First, a document is examined in an attempt to recognize it as an instance of a known document layout. Then a template, a scripted description of how to associate blocks of text in the layout with metadata fields, is applied to the document to extract the metadata. The extraction is validated after post-processing to evaluate the quality of the extraction and, if necessary, to flag untrusted extractions for human recognition. The success or failure of the template approach is directly tied to document classification, which is the ability to match the document to the proper template correctly and consistently. Document classification in our system is implemented as a module which applies every template available in the system to a document to find candidate templates that extract any data at all. The candidate templates are evaluated by a validation module to select the best performing template. This method is called post hoc classification. Post hoc classification is not only effective at selecting the correct class but it also excels at minimizing false positives. It is, however, very sensitive to changes in the template collection and to poorly written templates. While this dissertation examines the evolution and all the major components of an automated metadata extraction system, the primary focus is on the problem of document classification. The main thrust of my research has been investigating alternative methods of document classification to replace or supplement post hoc classification. I experimented with machine learning techniques as an additional input factor for the post hoc classification script or the final validation script

    Template-Based Metadata Extraction for Heterogeneous Collection

    Get PDF
    With the growth of the Internet and related tools, there has been a rapid growth of online resources. In particular, by using high-quality OCR (Optical Character Recognition) tools it has become easy to convert an existing corpus into digital form and make it available online. However, a number of organizations have legacy collections that lack metadata. The lack of metadata hampers not only the discovery and dispersion of these collections over the Web, but also their interoperability with other collections. Unfortunately, manual metadata creation is expensive and time-consuming for a large collection, and most existing automated metadata extraction approaches have focused on specific domains and homogeneous collections. Developing an approach to extract metadata automatically from a large number of challenges. In particular, the heterogeneous legacy collection poses a following issues need to be addressed: (1) Heterogeneity, i.e. how to achieve a high accuracy for a heterogeneous collection; (2) Scaling, i.e. how to apply an automated metadata extraction approach to a very large collection; (3) Evolution, i.e. how to process new documents added to a collection over time; (4) Adaptability, i.e. how to apply an approach to a new document collection; (5) Complexity, i.e. how many document features can be handled, and how complex the features should be. In this dissertation, we propose a template-based metadata extraction approach to address these issues. The key idea of addressing the heterogeneity is to classify documents into equivalent groups so that each document group contains similar documents only. Next, for each document group we create a template that contains a set of rules to instruct a template engine how to extract metadata from documents in the group. Templates are written in an XML-based language and kept in separate files. Our approach of decoupling rules from programming codes and representing them in a XML format is easy to adapt to another collection with documents in different styles. We developed our test bed by downloading about 10,000 documents from DTIC (Defense Technical Information Center) document collection that consists of scanned versions of documents in PDF (Portable Document Format) format. We have evaluated our approach on the test bed consisting of documents from DTIC collection, and our results are encouraging. We have also demonstrated how the extracted metadata can be utilized to integrate our test bed with an interoperable digital library framework based on OAI (Open Archives Initiative)

    Graphics Recognition -- from Re-engineering to Retrieval

    Get PDF
    Invited talk. Colloque avec actes et comité de lecture. internationale.International audienceIn this paper, we discuss how the focus in document analysis, generally speaking, and in graphics recognition more specifically, has moved from re-engineering problems to indexing and information retrieval. After a review of ongoing work on these topics, we propose some challenges for the years to come

    Automatic Extraction and Assessment of Entities from the Web

    Get PDF
    The search for information about entities, such as people or movies, plays an increasingly important role on the Web. This information is still scattered across many Web pages, making it more time consuming for a user to find all relevant information about an entity. This thesis describes techniques to extract entities and information about these entities from the Web, such as facts, opinions, questions and answers, interactive multimedia objects, and events. The findings of this thesis are that it is possible to create a large knowledge base automatically using a manually-crafted ontology. The precision of the extracted information was found to be between 75–90 % (facts and entities respectively) after using assessment algorithms. The algorithms from this thesis can be used to create such a knowledge base, which can be used in various research fields, such as question answering, named entity recognition, and information retrieval

    Recent Advances in Social Data and Artificial Intelligence 2019

    Get PDF
    The importance and usefulness of subjects and topics involving social data and artificial intelligence are becoming widely recognized. This book contains invited review, expository, and original research articles dealing with, and presenting state-of-the-art accounts pf, the recent advances in the subjects of social data and artificial intelligence, and potentially their links to Cyberspace

    Improving Automated Layout Techniques for the Production of Schematic Diagrams

    Get PDF
    This thesis explores techniques for the automated production of schematic diagrams, in particular those in the style of metro maps. Metro map style schematics are used across the world, typically to depict public transport networks, and therefore benefit from an innate level of user familiarity not found with most other data visualisation styles. Currently, this style of schematic is used infrequently due to the difficulties involved with creating an effective layout – there are no software tools to aid with the positioning of nodes and other features, resulting in schematics being produced by hand at great expense of time and effort. Automated schematic layout has been an active area of research for the past decade, and part of our work extends upon an effective current technique – multi-criteria hill climbing. We have implemented additional layout criteria and clustering techniques, as well as performance optimisations to improve the final results. Additionally, we ran a series of layouts whilst varying algorithm parameters in an attempt to identify patterns specific to map characteristics. This layout algorithm has been implemented into a custom-written piece of software running on the Android operating system. The software is targeted at tablet devices, using their touch-sensitive screens with a gesture recognition system to allow users to construct complex schematics using sequences of simple gestures. Following on from this, we present our work on a modified force-directed layout method capable of producing fast, high-quality, angular schematic layouts. Our method produces superior results to the previous octilinear force-directed layout method, and is capable of producing results comparable to many of the much slower current approaches. Using our force-directed layout method we then implemented a novel mental map preservation technique which aims to preserve node proximity relations during optimisation; we believe this approach provides a number of benefits over the the more common method of preserving absolute node positions. Finally, we performed a user study on our method to test the effect of varying levels of mental map preservation on diagram comprehension

    Retrieval by Layout Similarity of Documents Represented with MXY Trees

    No full text
    Abstract. Document image retrievalcan be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a treebased representation: the Modified X-Y tree. Each page in the database is represented by a feature vector containing both global features of the page and a vectorialrepresentation of its layout that is derived from the corresponding MXY tree. Occurrences of tree patterns are handled similarly to index terms in Information Retrieval in order to compute the similarity. When retrieving relevant documents, the images in the collection are sorted on the basis of a measure that is the combination of two values describing the similarity of global features and of the occurrences of tree patterns. The system is applied to the retrieval of documents belonging to digital libraries. Tests of the system are made on a data-set of more than 600 pages belonging to a journal of the 19th Century, and to a collection of monographs printed in the same Century and containing more than 600 pages.
    corecore