304,216 research outputs found

    Coherence Identification of Business Documents: Towards an Automated Message Processing System

    Get PDF
    This paper describes our recent efforts in developing a text segmentation technique in our business document management system. The document analysis is based upon a knowledge-based analysis of the documents’ contents, by automating the coherence identification process, without a full semantic understanding. In the technique, document boundaries can be identified by observing the shifts of segments from one cluster to another. Our experimental results show that the combination of the heterogeneous knowledge is capable to address the topic shifts. Given the increasing recognition of document structure in the fields of information retrieval as well as knowledge management, this approach provides a quantitative model and automatic classification of documents in a business document management system. This will beneficial to the distribution of documents or automatic launching of business processes in a workflow management system

    Adaptive Methods for Robust Document Image Understanding

    Get PDF
    A vast amount of digital document material is continuously being produced as part of major digitization efforts around the world. In this context, generic and efficient automatic solutions for document image understanding represent a stringent necessity. We propose a generic framework for document image understanding systems, usable for practically any document types available in digital form. Following the introduced workflow, we shift our attention to each of the following processing stages in turn: quality assurance, image enhancement, color reduction and binarization, skew and orientation detection, page segmentation and logical layout analysis. We review the state of the art in each area, identify current defficiencies, point out promising directions and give specific guidelines for future investigation. We address some of the identified issues by means of novel algorithmic solutions putting special focus on generality, computational efficiency and the exploitation of all available sources of information. More specifically, we introduce the following original methods: a fully automatic detection of color reference targets in digitized material, accurate foreground extraction from color historical documents, font enhancement for hot metal typesetted prints, a theoretically optimal solution for the document binarization problem from both computational complexity- and threshold selection point of view, a layout-independent skew and orientation detection, a robust and versatile page segmentation method, a semi-automatic front page detection algorithm and a complete framework for article segmentation in periodical publications. The proposed methods are experimentally evaluated on large datasets consisting of real-life heterogeneous document scans. The obtained results show that a document understanding system combining these modules is able to robustly process a wide variety of documents with good overall accuracy

    Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference

    Get PDF
    Since 2001, the Document Understanding Conferences have been the forum for researchers in automatic text summarization to compare methods and results on common test sets. Over the years, several types of summarization tasks have been addressed--single document summarization, multi-document summarization, summarization focused by question, and headline generation. This paper is an overview of the achieved results in the different types of summarization tasks. We compare both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic). An analysis of variance model is fitted, with summarizer and input set as independent variables, and the coverage score as the dependent variable, and simulation-based multiple comparisons were performed. The results document the progress in the field as a whole, rather then focusing on a single system, and thus can serve as a future reference on the work done up to date, as well as a starting point in the formulation of future tasks. Results also indicate that most progress in the field has been achieved in generic multi-document summarization and that the most challenging task is that of producing a focused summary in answer to a question/topic

    Using IR techniques for text classification in document analysis

    Get PDF
    This paper presents the INFOCLAS system applying statistical methods of information retrieval for the classification of German business letters into corresponding message types such as order, offer, enclosure, etc. INFOCLAS is a first step towards the understanding of documents proceeding to a classification-driven extraction of information. The system is composed of two main modules: the central indexer (extraction and weighting of indexing terms) and the classifier (classification of business letters into given types). The system employs several knowledge sources including a letter database, word frequency statistics for German, lists of message type specific words, morphological knowledge as well as the underlying document structure. As output, the system evaluates a set of weighted hypotheses about the type of the actual letter. Classification of documents allow the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    Document highlighting - message classification in printed business letters

    Get PDF
    This paper presents the INFOCLAS system applying statistical methods of information retrieval primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, etc. INFOCLAS is a first step towards understanding of documents. Actually, it is composed of three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focuser (highlighting relevant letter parts). The system employs several knowledge sources including a database of about 100 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model. As output, the system evaluates a set of weighted hypotheses about the type of letter at hand, or highlights relevant text (text focus), respectively. Classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    Analysis of methods

    Get PDF
    Information is one of an organization's most important assets. For this reason the development and maintenance of an integrated information system environment is one of the most important functions within a large organization. The Integrated Information Systems Evolution Environment (IISEE) project has as one of its primary goals a computerized solution to the difficulties involved in the development of integrated information systems. To develop such an environment a thorough understanding of the enterprise's information needs and requirements is of paramount importance. This document is the current release of the research performed by the Integrated Development Support Environment (IDSE) Research Team in support of the IISEE project. Research indicates that an integral part of any information system environment would be multiple modeling methods to support the management of the organization's information. Automated tool support for these methods is necessary to facilitate their use in an integrated environment. An integrated environment makes it necessary to maintain an integrated database which contains the different kinds of models developed under the various methodologies. In addition, to speed the process of development of models, a procedure or technique is needed to allow automatic translation from one methodology's representation to another while maintaining the integrity of both. The purpose for the analysis of the modeling methods included in this document is to examine these methods with the goal being to include them in an integrated development support environment. To accomplish this and to develop a method for allowing intra-methodology and inter-methodology model element reuse, a thorough understanding of multiple modeling methodologies is necessary. Currently the IDSE Research Team is investigating the family of Integrated Computer Aided Manufacturing (ICAM) DEFinition (IDEF) languages IDEF(0), IDEF(1), and IDEF(1x), as well as ENALIM, Entity Relationship, Data Flow Diagrams, and Structure Charts, for inclusion in an integrated development support environment

    An overview of information extraction techniques for legal document analysis and processing

    Get PDF
    In an Indian law system, different courts publish their legal proceedings every month for future reference of legal experts and common people. Extensive manual labor and time are required to analyze and process the information stored in these lengthy complex legal documents. Automatic legal document processing is the solution to overcome drawbacks of manual processing and will be very helpful to the common man for a better understanding of a legal domain. In this paper, we are exploring the recent advances in the field of legal text processing and provide a comparative analysis of approaches used for it. In this work, we have divided the approaches into three classes NLP based, deep learning-based and, KBP based approaches. We have put special emphasis on the KBP approach as we strongly believe that this approach can handle the complexities of the legal domain well. We finally discuss some of the possible future research directions for legal document analysis and processing

    Development of Application Program for Harmonic Analysis

    Get PDF
    Increased power quality problems due to intensive usage of power electronic devices resulted in development of software applications to perform quick harmonic analysis. However, the present harmonic analysis applications have special software or computer locks requirements and occupy huge memory and cost high. An application program (using Microsoft Visual C++) that is simple yet accurate in calculations; with no special software or high memory requirements is developed in this thesis work. The program uses the automatic acceptance criteria (AAC) and the harmonic penetration techniques in calculating the system voltages. Several userriendly features and tools that aid in better understanding of system harmonics are included in the program. Comparison of case study results with Superharm simulation results proves the program?s accuracy. This thesis work resulted in an informative and time saving program with which the user can document the study results and analyze them with minimum effort
    corecore