4 research outputs found

    Document highlighting - message classification in printed business letters

    Get PDF
    This paper presents the INFOCLAS system applying statistical methods of information retrieval primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, etc. INFOCLAS is a first step towards understanding of documents. Actually, it is composed of three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focuser (highlighting relevant letter parts). The system employs several knowledge sources including a database of about 100 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model. As output, the system evaluates a set of weighted hypotheses about the type of letter at hand, or highlights relevant text (text focus), respectively. Classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    Automatische, Deskriptor-basierte Unterstützung der Dokumentanalyse zur Fokussierung und Klassifizierung von Geschäftsbriefen

    Get PDF
    Die vorliegende Arbeit wurde im Rahmen des ALV-Projekts (Automatisches Lesen und Verstehen) am Deutschen Forschungszentrum für Künstliche Intelligenz (DFKI) erstellt. Ziel des ALV-Projektes ist die Entwicklung einer intelligenten Schnittstelle zwischen Papier und Rechner (paper-computer interface). Hierbei soll durch Nachahmung des menschlichen Leseverhaltens ein Schritt in Richtung papierloses Büro ausgeführt werden. Exemplarisch werden in ALV Geschäftsbriefe als Domäne untersucht. Teilgebiete innerhalb des ALV-Projekts sind Layoutextraktion, Logical Labeling, Texterkennung und Textanalyse. Diese Arbeit fällt in den Bereich der Textanalyse. Die Aufgabenstellung bestand darin, mittels der vorkommenden Wörter (im Brieftext) die Art des Briefes sowie erste Hinweise über die Intention des Briefautors zu ermitteln. Derartige Informationen können von anderen Experten zur weiteren Verarbeitung, Verteilung und Archivierung der Briefe genutzt werden. Das innerhalb einer Diplomarbeit entwickelte und implementierte INFOCLAS-System versucht deshalb auf der Basis statistischer Verfahren und Methodiken aus dem Information Retrieval folgende Funktionalität bereitzustellen: i) Extrahierung und Gewichtung von bedeutungstragenden Wörtern; ii) Ermittelung der Kernaussage (Fokus) eines Geschäftsbriefs; iii) Klassifizierung eines Geschäftsbriefs in vordefinierte Nachrichtentypen. Die dafür entwickelten Module Indexierer, Fokussierer und Klassifizierer benutzen -- neben Konzepten aus dem Information Retrieval -- eine Datenbasis, die eine Sammlung von Geschäftsbriefen enthält, sowie spezifische Wortlisten, die die modellierten Briefklassen repräsentieren. Als weiteres Hilfsmittel dient ein morphologisches Werkzeug zur grammatikalischen Analyse der Wörter. Mit diesen Wissensquellen werden Hypothesen über die Briefklasse und die Kernaussage des Briefinhalts aufgestellt.In this documentation existing techniques of information retrieval (IR) are compared and evaluated for their application in document analysis and understanding. Moreover, we have developed a system called INFOCLAS which uses appropriate statistical methods of IR, primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, inquiry, and advertisement. INFOCLAS is a first step towards understanding of business letters. Actually, it comprises three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focusser (highlighting relevant parts of the letter). INFOCLAS integrates several knowledge sources including a database of about 120 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model (layout and logical structure). As output, the system computes a set of weighted hypotheses about the type of letter at hand. A classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    Automatic office document classification and information extraction

    Get PDF
    TEXPR.OS (TEXt PROcessing System) is a document processing system (DPS) to support and assist office workers in their daily work in dealing with information and document management. In this thesis, document classification and information extraction, which are two of the major functional capabilities in TEXPROS, are investigated. Based on the nature of its content, a document is divided into structured and unstructured (i.e., of free text) parts. The conceptual and content structures are introduced to capture the semantics of the structured and unstructured part of the document respectively. The document is classified and information is extracted based on the analyses of conceptual and content structures. In our approach, the layout structure of a document is used to assist the analyses of the conceptual and content structures of the document. By nested segmentation of a document, the layout structure of the document is represented by an ordered labeled tree structure, called Layout Structure Tree (L-S-Tree). Sample-based classification mechanism is adopted in our approach for classifying the documents. A set of pre-classified documents are stored in a document sample base in the form of sample trees. In the layout analysis, an approximate tree matching is used to match the L-S-Tree of a document to be classified against the sample trees. The layout similarities between the document and the sample documents are evaluated based on the edit distance between the L-S-Tree of the document and the sample trees. The document samples which have the similar layout structure to the document are chosen to be used for the conceptual analysis of the document. In the conceptual analysis of the document, based on the mapping between the document and document samples, which was found during the layout analysis, the conceptual similarities between the document and the sample documents are evaluated based on the degree of conceptual closeness degree . The document sample which has the similar conceptual structure to the document is chosen to be used for extracting information. Extracting the information of the structured part of the document is based on the layout locations of key terms appearing in the document and string pattern matching. Based on the information extracted from the structured part of the document the type of the document is identified. In the content analysis of the document, the bottom-up and top-down analyses on the free text are combined to extract information from the unstructured part of the document. In the bottom-up analysis, the sentences of the free text are classified into those which are relevant or irrelevant to the extraction. The sentence classification is based on the semantical relationship between the phrases in the sentences and the attribute names in the corresponding content structure by consulting the thesaurus. Then the thematic roles of the phrases in each relevant sentence are identified based on the syntactic analysis and heuristic thematic analysis. In the top-down analysis, the appropriate content structure is identified based on the document type identified in the conceptual analysis. Then the information is extracted from the unstructured part of the document by evaluating the restrictions specified in the corresponding content structure based on the result of bottom-up analysis. The information extracted from the structured and unstructured parts of the document are stored in the form of a frame like structure (frame instance) in the data base for information retrieval in TEXPROS

    Knowledge based document classification supporting integrated document handling

    No full text
    corecore