54 research outputs found

    Document highlighting - message classification in printed business letters

    Get PDF
    This paper presents the INFOCLAS system applying statistical methods of information retrieval primarily for the classification of German business letters into corresponding message types such as order, offer, confirmation, etc. INFOCLAS is a first step towards understanding of documents. Actually, it is composed of three modules: the central indexer (extraction and weighting of indexing terms), the classifier (classification of business letters into given types) and the focuser (highlighting relevant letter parts). The system employs several knowledge sources including a database of about 100 letters, word frequency statistics for German, message type specific words, morphological knowledge as well as the underlying document model. As output, the system evaluates a set of weighted hypotheses about the type of letter at hand, or highlights relevant text (text focus), respectively. Classification of documents allows the automatic distribution or archiving of letters and is also an excellent starting point for higher-level document analysis

    NATURAL LANGUAGE DOCUMENTS: INDEXING AND RETRIEVAL IN AN INFORMATION SYSTEM

    Get PDF
    A steadily increasing number of natural language (NL) documents are handled in information systems. Most of these documents typically contain some formatted data, which we call strong database data, and additionally some unformatted data, i.e., free text. The task of a modern information system is to characterize such unformatted (text) data automatically and, in doing so, to support the user in storing and retrieving natural language documents. The retrieval of natural language documents is a fuzzy process because the user will formulate fuzzy queries unless he uses some strong search keys. Retrieval of natural language documents can be facilitated with natural language queries; that is, with searches based on natural language text comparisons

    HypIR: Hypertext Based Information Retrieval

    Get PDF
    Information Retrieval (IR), which is also known as text or document retrieval, is the process of locating and retrieving docri)nents that are relevant to the user queries. In hypertext environments, docuinent databases are organized as a network of nodes which are interconnected by various types of links. This study introduces a hypertext-based text retrieval system, HypIR. In HypIR, the sentantic relationships ainong docuinents are obtained using a clustering algorithm. A new approach providing the advantages of system maps and history list is introduced to prevent the user fiotn being lost in the IR hivperspace. The paper presents the underlying concepts and iinplementation details. HypIR is based on the object-oriented paradigm and its execution platforin is HyperCard

    Dynamic Signature File Partitioning Based on Term Characteristics

    Get PDF
    Signature files act as a filter on retrieval to discard a large number of non-qualifying data items. Linear hashing with superimposed signatures (LHSS) provides an effective retrieval filter to process queries in dynamic databases. This study is an analysis of the effects of reflecting the term query and occurrence characteristics to signatures in LHSS. This approach relaxes the unrealistic uniform frequency assumption and lets the terms with high discriminatory power set more bits in signatures. The simulation experiments based on the derived formulas show that incorporating the term characteristics in LHSS improves retrieval efficiency. The paper also discusses the further benefits of this approach to alleviate the potential imbalance between the levels of efficiency and relevancy

    The use of abstraction concepts for representing and structuring documents

    Get PDF
    Due to the amount of documents available in modern offices, it is necessary to provide a multitude of methods for the structuring of knowledge, i.e., abstraction concepts. In order to achieve their uniform representation, such concepts should be considered in an integrated fashion to allow concise descriptions free of redundancy. In this paper, we present our approach towards an integration of methods of knowledge structuring. For this purpose, our view of abstraction concepts is briefly introduced using examples of the document world and compared with some existing systems. The main focus of this paper is to show the applicability of an integration of these abstraction concepts as well as their built-in reasoning facilities in supporting document processing and management

    Signature File Hashing Using Term Occurrence and Query Frequencies

    Get PDF
    Signature files act as a filter on retrieval to discard a large number of non-qualifying data items. Linear hashing with superimposed signatures (LHSS) provides an effective retrieval filter to process queries in dynamic databases. This study is an analysis of the effects of reflecting the term occurrence and query frequencies to signatures in LHSS. This approach relaxes the unrealistic uniform frequency assumption and lets the terms with high discriminatory power set more bits in signatures. The simulation experiments based on the derived formulas explore the amount of page savings with different occurrence and query frequency combinations at different hashing levels. The results show that the performance of LHSS improves with the hashing level and the larger is the difference between the term discriminatory power values of the terms, the higher is the retrieval efficiency. The paper also discusses the benefits of this approach to alleviate the imbalance between the levels of efficiency and relevancy in unrealistic uniform frequency assumption case

    Signature-based Tree for Finding Frequent Itemsets

    Get PDF
    The efficiency of a data mining process depends on the data structure used to find frequent itemsets. Two approaches are possible: use the original transaction dataset or transform it into another more compact structure. Many algorithms use trees as compact structure, like FP-Tree and the associated algorithm FP-Growth. Although this structure reduces the number of scans (only 2), its efficiency depends on two criteria: (i) the size of the support (small or large); (ii) the type of transaction dataset (sparse or dense). But these two criteria can generate very large trees. In this paper, we propose a new tree-based structure that emphasizes on transactions and not on itemsets. Hence, we avoid the problem of support values that have a negative impact on the generated tree

    Effect of Tunable Indexing on Term Distribution and Cluster-based Information Retrieval Performance

    Get PDF
    The purpose of this study is to investigate the effect of tunable indexing on the structure and information retrieval performance of a clustered document database. The generation of all cluster structures and calculation of term discrimination values is based upon the Cover Coefficient-Based Clustering Methodology. Information retrieval performance is measured in terms of precision, recall, and e-measure. The relationship between term generality and term discrimination value is quantified using the Pearson Rank Correlation Coefficient Test. The effect of tunable indexing on index term distribution and on the number of target clusters is examined
    corecore