4 research outputs found

    Review of Indexing Techniques Applied in Information Retrieval

    Get PDF
    Indexing is one of the important tasks of Information Retrieval that can be applied to any form of data, generated from the web, databases, etc. As the size of corpora increases, indexing becomes too time consuming and labor intensive, therefore, the introduction of computer aided indexer. A review of indexing techniques, both human and automatic indexing has been done in this paper. This paper gives an outline of the use of automatic indexing by discussing various hashing techniques including fuzzy finger printing and locality-sensitive hashing. Two different processes of matching that are used in automatic subject indexing are also reviewed. Accepting the need of automatic indexing in a possible replacement to manual indexing, studies in the development of automatic indexing tools must continu

    Automatic index generation for the free-text based database.

    Get PDF
    by Leung Chi Hong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.Includes bibliographical references (leaves 183-184).Chapter Chapter one: --- Introduction --- p.1Chapter Chapter two: --- Background knowledge and linguistic approaches of automatic indexing --- p.5Chapter 2.1 --- Definition of index and indexing --- p.5Chapter 2.2 --- Indexing methods and problems --- p.7Chapter 2.3 --- Automatic indexing and human indexing --- p.8Chapter 2.4 --- Different approaches of automatic indexing --- p.10Chapter 2.5 --- Example of semantic approach --- p.11Chapter 2.6 --- Example of syntactic approach --- p.14Chapter 2.7 --- Comments on semantic and syntactic approaches --- p.18Chapter Chapter three: --- Rationale and methodology of automatic index generation --- p.19Chapter 3.1 --- Problems caused by natural language --- p.19Chapter 3.2 --- Usage of word frequencies --- p.20Chapter 3.3 --- Brief description of rationale --- p.24Chapter 3.4 --- Automatic index generation --- p.27Chapter 3.4.1 --- Training phase --- p.27Chapter 3.4.1.1 --- Selection of training documents --- p.28Chapter 3.4.1.2 --- Control and standardization of variants of words --- p.28Chapter 3.4.1.3 --- Calculation of associations between words and indexes --- p.30Chapter 3.4.1.4 --- Discarding false associations --- p.33Chapter 3.4.2 --- Indexing phase --- p.38Chapter 3.4.3 --- Example of automatic indexing --- p.41Chapter 3.5 --- Related researches --- p.44Chapter 3.6 --- Word diversity and its effect on automatic indexing --- p.46Chapter 3.7 --- Factors affecting performance of automatic indexing --- p.60Chapter 3.8 --- Application of semantic representation --- p.61Chapter 3.8.1 --- Problem of natural language --- p.61Chapter 3.8.2 --- Use of concept headings --- p.62Chapter 3.8.3 --- Example of using concept headings in automatic indexing --- p.65Chapter 3.8.4 --- Advantages of concept headings --- p.68Chapter 3.8.5 --- Disadvantages of concept headings --- p.69Chapter 3.9 --- Correctness prediction for proposed indexes --- p.78Chapter 3.9.1 --- Example of using index proposing rate --- p.80Chapter 3.10 --- Effect of subject matter on automatic indexing --- p.83Chapter 3.11 --- Comparison with other indexing methods --- p.85Chapter 3.12 --- Proposal for applying Chinese medical knowledge --- p.90Chapter Chapter four: --- Simulations of automatic index generation --- p.93Chapter 4.1 --- Training phase simulations --- p.93Chapter 4.1.1 --- Simulation of association calculation (word diversity uncontrolled) --- p.94Chapter 4.1.2 --- Simulation of association calculation (word diversity controlled) --- p.102Chapter 4.1.3 --- Simulation of discarding false associations --- p.107Chapter 4.2 --- Indexing phase simulation --- p.115Chapter 4.3 --- Simulation of using concept headings --- p.120Chapter 4.4 --- Simulation for testing performance of predicting index correctness --- p.125Chapter 4.5 --- Summary --- p.128Chapter Chapter five: --- Real case study in database of Chinese Medicinal Material Research Center --- p.130Chapter 5.1 --- Selection of real documents --- p.130Chapter 5.2 --- Case study one: Overall performance using real data --- p.132Chapter 5.2.1 --- Sample results of automatic indexing for real documents --- p.138Chapter 5.3 --- Case study two: Using multi-word terms --- p.148Chapter 5.4 --- Case study three: Using concept headings --- p.152Chapter 5.5 --- Case study four: Prediction of proposed index correctness --- p.156Chapter 5.6 --- Case study five: Use of (Σ ΔRij) Fi to determine false association --- p.159Chapter 5.7 --- Case study six: Effect of word diversity --- p.162Chapter 5.8 --- Summary --- p.166Chapter Chapter six: --- Conclusion --- p.168Appendix A: List of stopwords --- p.173Appendix B: Index terms used in case studies --- p.174References --- p.18

    A corpus-based induction learning approach to natural language processing.

    Get PDF
    by Leung Chi Hong.Thesis (Ph.D.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 163-171).Chapter Chapter 1. --- Introduction --- p.1Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9Chapter 2.1. --- Knowledge-based approach --- p.9Chapter 2.1.1. --- Morphological analysis --- p.10Chapter 2.1.2. --- Syntactic parsing --- p.11Chapter 2.1.3. --- Semantic parsing --- p.16Chapter 2.1.3.1. --- Semantic grammar --- p.19Chapter 2.1.3.2. --- Case grammar --- p.20Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22Chapter 2.2. --- Corpus-based approach --- p.23Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25Chapter 2.2.3. --- Annotated corpus --- p.26Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28Chapter 2.4. --- Co-operation between two different approaches --- p.32Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35Chapter 3.1. --- General model of traditional corpus-based approach --- p.36Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39Chapter 3.3. --- Induction learning --- p.41Chapter 3.3.1. --- General issues about induction learning --- p.41Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50Chapter 3.5. --- Learning feedback modification --- p.52Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63Chapter 4.3. --- Phrase identification procedure --- p.64Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65Chapter 4.4. --- Template identification procedure --- p.67Chapter 4.5. --- Experiment and result --- p.70Chapter 4.5.1. --- Testing data --- p.70Chapter 4.5.2. --- Details of experiments --- p.71Chapter 4.5.3. --- Experimental results --- p.72Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73Chapter 4.6. --- Conclusion --- p.74Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76Chapter 5.1. --- Background of Chinese word segmentation --- p.77Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78Chapter 5.2.1. --- Syntactic and semantic approach --- p.78Chapter 5.2.2. --- Statistical approach --- p.79Chapter 5.2.3. --- Heuristic approach --- p.81Chapter 5.3. --- Problems in word segmentation --- p.82Chapter 5.3.1. --- Chinese word definition --- p.82Chapter 5.3.2. --- Word dictionary --- p.83Chapter 5.3.3. --- Word segmentation ambiguity --- p.84Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86Chapter 5.4.1. --- Rationale of approach --- p.87Chapter 5.4.2. --- Method of constructing modification rules --- p.89Chapter 5.5. --- Experiment and results --- p.94Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99Chapter 5.9. --- Problems in the approach --- p.100Chapter 5.10. --- Conclusion --- p.101Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103Chapter 6.1. --- Background of automatic indexing --- p.103Chapter 6.1.1. --- Definition of index term and indexing --- p.103Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110Chapter 6.2.2. --- Procedure of automatic indexing --- p.111Chapter 6.2.2.1. --- Learning process --- p.112Chapter 6.2.2.2. --- Indexing process --- p.118Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119Chapter 6.3.1.2. --- Details of the experiment --- p.119Chapter 6.3.1.3. --- Experimental result --- p.121Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127Chapter 6.4. --- Learning feedback modification --- p.128Chapter 6.4.1. --- Positive feedback --- p.129Chapter 6.4.2. --- Negative feedback --- p.131Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136Chapter 6.5. --- Conclusion --- p.138Chapter Chapter 7. --- Conclusion --- p.140Appendix A: Some examples of identified phrases in financial news articles --- p.149Appendix B: Some examples of identified templates in financial news articles --- p.150Appendix C: Some examples of texts containing the templates in financial news articles --- p.151Appendix D: Some examples of identified phrases in political news articles --- p.152Appendix E: Some examples of identified templates in political news articles --- p.153Appendix F: Some examples of texts containing the templates in political news articles --- p.154Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155Appendix H: An example of semantic approach to automatic indexing --- p.156Appendix I: An example of syntactic approach to automatic indexing --- p.158Appendix J: Samples of INSPEC and MEDLINE Records --- p.161Appendix K: Examples of Promoting and Demoting Words --- p.162References --- p.16
    corecore