8 research outputs found

    Signature file access methodologies for text retrieval: a literature review with additional test cases

    Get PDF
    Signature files are extremely compressed versions of text files which can be used as access or index files to facilitate searching documents for text strings. These access files, or signatures, are generated by storing hashed codes for individual words. Given the possible generation of similar codes in the hashing or storing process, the primary concern in researching signature files is to determine the accuracy of retrieving information. Inaccuracy is always represented by the false signaling of the presence of a text string. Two suggested ways to alter false drop rates are: 1) to determine if either of the two methologies for storing hashed codes, by superimposing them or by concatenating them, is more efficient; and 2) to determine if a particular hashing algorithm has any impact. To assess these issues, the history of suprimposed coding is traced from its development as a tool for compressing information onto punched cards in the 1950s to its incorporation into proposed signature file methodologies in the mid-1980\u27 s. Likewise, the concept of compressing individual words by various algorithms, or by hashing them is traced through the research literature. Following this literature review, benchmark trials are performed using both superimposed and concatenated methodologies while varying hashing algorithms. It is determined that while one combination of hashing algorithm and storage methodology is better, all signature file mehods can be considered viable

    Symbolic and Visual Retrieval of Mathematical Notation using Formula Graph Symbol Pair Matching and Structural Alignment

    Get PDF
    Large data collections containing millions of math formulae in different formats are available on-line. Retrieving math expressions from these collections is challenging. We propose a framework for retrieval of mathematical notation using symbol pairs extracted from visual and semantic representations of mathematical expressions on the symbolic domain for retrieval of text documents. We further adapt our model for retrieval of mathematical notation on images and lecture videos. Graph-based representations are used on each modality to describe math formulas. For symbolic formula retrieval, where the structure is known, we use symbol layout trees and operator trees. For image-based formula retrieval, since the structure is unknown we use a more general Line of Sight graph representation. Paths of these graphs define symbol pairs tuples that are used as the entries for our inverted index of mathematical notation. Our retrieval framework uses a three-stage approach with a fast selection of candidates as the first layer, a more detailed matching algorithm with similarity metric computation in the second stage, and finally when relevance assessments are available, we use an optional third layer with linear regression for estimation of relevance using multiple similarity scores for final re-ranking. Our model has been evaluated using large collections of documents, and preliminary results are presented for videos and cross-modal search. The proposed framework can be adapted for other domains like chemistry or technical diagrams where two visually similar elements from a collection are usually related to each other

    A corpus-based induction learning approach to natural language processing.

    Get PDF
    by Leung Chi Hong.Thesis (Ph.D.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 163-171).Chapter Chapter 1. --- Introduction --- p.1Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9Chapter 2.1. --- Knowledge-based approach --- p.9Chapter 2.1.1. --- Morphological analysis --- p.10Chapter 2.1.2. --- Syntactic parsing --- p.11Chapter 2.1.3. --- Semantic parsing --- p.16Chapter 2.1.3.1. --- Semantic grammar --- p.19Chapter 2.1.3.2. --- Case grammar --- p.20Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22Chapter 2.2. --- Corpus-based approach --- p.23Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25Chapter 2.2.3. --- Annotated corpus --- p.26Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28Chapter 2.4. --- Co-operation between two different approaches --- p.32Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35Chapter 3.1. --- General model of traditional corpus-based approach --- p.36Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39Chapter 3.3. --- Induction learning --- p.41Chapter 3.3.1. --- General issues about induction learning --- p.41Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50Chapter 3.5. --- Learning feedback modification --- p.52Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63Chapter 4.3. --- Phrase identification procedure --- p.64Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65Chapter 4.4. --- Template identification procedure --- p.67Chapter 4.5. --- Experiment and result --- p.70Chapter 4.5.1. --- Testing data --- p.70Chapter 4.5.2. --- Details of experiments --- p.71Chapter 4.5.3. --- Experimental results --- p.72Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73Chapter 4.6. --- Conclusion --- p.74Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76Chapter 5.1. --- Background of Chinese word segmentation --- p.77Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78Chapter 5.2.1. --- Syntactic and semantic approach --- p.78Chapter 5.2.2. --- Statistical approach --- p.79Chapter 5.2.3. --- Heuristic approach --- p.81Chapter 5.3. --- Problems in word segmentation --- p.82Chapter 5.3.1. --- Chinese word definition --- p.82Chapter 5.3.2. --- Word dictionary --- p.83Chapter 5.3.3. --- Word segmentation ambiguity --- p.84Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86Chapter 5.4.1. --- Rationale of approach --- p.87Chapter 5.4.2. --- Method of constructing modification rules --- p.89Chapter 5.5. --- Experiment and results --- p.94Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99Chapter 5.9. --- Problems in the approach --- p.100Chapter 5.10. --- Conclusion --- p.101Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103Chapter 6.1. --- Background of automatic indexing --- p.103Chapter 6.1.1. --- Definition of index term and indexing --- p.103Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110Chapter 6.2.2. --- Procedure of automatic indexing --- p.111Chapter 6.2.2.1. --- Learning process --- p.112Chapter 6.2.2.2. --- Indexing process --- p.118Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119Chapter 6.3.1.2. --- Details of the experiment --- p.119Chapter 6.3.1.3. --- Experimental result --- p.121Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127Chapter 6.4. --- Learning feedback modification --- p.128Chapter 6.4.1. --- Positive feedback --- p.129Chapter 6.4.2. --- Negative feedback --- p.131Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136Chapter 6.5. --- Conclusion --- p.138Chapter Chapter 7. --- Conclusion --- p.140Appendix A: Some examples of identified phrases in financial news articles --- p.149Appendix B: Some examples of identified templates in financial news articles --- p.150Appendix C: Some examples of texts containing the templates in financial news articles --- p.151Appendix D: Some examples of identified phrases in political news articles --- p.152Appendix E: Some examples of identified templates in political news articles --- p.153Appendix F: Some examples of texts containing the templates in political news articles --- p.154Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155Appendix H: An example of semantic approach to automatic indexing --- p.156Appendix I: An example of syntactic approach to automatic indexing --- p.158Appendix J: Samples of INSPEC and MEDLINE Records --- p.161Appendix K: Examples of Promoting and Demoting Words --- p.162References --- p.16
    corecore