18,888 research outputs found

    A corpus-based induction learning approach to natural language processing.

    Get PDF
    by Leung Chi Hong.Thesis (Ph.D.)--Chinese University of Hong Kong, 1996.Includes bibliographical references (leaves 163-171).Chapter Chapter 1. --- Introduction --- p.1Chapter Chapter 2. --- Background Study of Natural Language Processing --- p.9Chapter 2.1. --- Knowledge-based approach --- p.9Chapter 2.1.1. --- Morphological analysis --- p.10Chapter 2.1.2. --- Syntactic parsing --- p.11Chapter 2.1.3. --- Semantic parsing --- p.16Chapter 2.1.3.1. --- Semantic grammar --- p.19Chapter 2.1.3.2. --- Case grammar --- p.20Chapter 2.1.4. --- Problems of knowledge acquisition in knowledge-based approach --- p.22Chapter 2.2. --- Corpus-based approach --- p.23Chapter 2.2.1. --- Beginning of corpus-based approach --- p.23Chapter 2.2.2. --- An example of corpus-based application: word tagging --- p.25Chapter 2.2.3. --- Annotated corpus --- p.26Chapter 2.2.4. --- State of the art in the corpus-based approach --- p.26Chapter 2.3. --- Knowledge-based approach versus corpus-based approach --- p.28Chapter 2.4. --- Co-operation between two different approaches --- p.32Chapter Chapter 3. --- Induction Learning applied to Corpus-based Approach --- p.35Chapter 3.1. --- General model of traditional corpus-based approach --- p.36Chapter 3.1.1. --- Division of a problem into a number of sub-problems --- p.36Chapter 3.1.2. --- Solution selected from a set of predefined choices --- p.36Chapter 3.1.3. --- Solution selection based on a particular kind of linguistic entity --- p.37Chapter 3.1.4. --- Statistical correlations between solutions and linguistic entities --- p.37Chapter 3.1.5. --- Prediction of the best solution based on statistical correlations --- p.38Chapter 3.2. --- First problem in the corpus-based approach: Irrelevance in the corpus --- p.39Chapter 3.3. --- Induction learning --- p.41Chapter 3.3.1. --- General issues about induction learning --- p.41Chapter 3.3.2. --- Reasons of using induction learning in the corpus-based approach --- p.43Chapter 3.3.3. --- General model of corpus-based induction learning approach --- p.45Chapter 3.3.3.1. --- Preparation of positive corpus and negative corpus --- p.45Chapter 3.3.3.2. --- Statistical correlations between solutions and linguistic entities --- p.46Chapter 3.3.3.3. --- Combination of the statistical correlations obtained from the positive and negative corpora --- p.48Chapter 3.4. --- Second problem in the corpus-based approach: Modification of initial probabilistic approximations --- p.50Chapter 3.5. --- Learning feedback modification --- p.52Chapter 3.5.1. --- Determination of which correlation scores to be modified --- p.52Chapter 3.5.2. --- Determination of the magnitude of modification --- p.53Chapter 3.5.3. --- An general algorithm of learning feedback modification --- p.56Chapter Chapter 4. --- Identification of Phrases and Templates in Domain-specific Chinese Texts --- p.59Chapter 4.1. --- Analysis of the problem solved by the traditional corpus-based approach --- p.61Chapter 4.2. --- Phrase identification based on positive and negative corpora --- p.63Chapter 4.3. --- Phrase identification procedure --- p.64Chapter 4.3.1. --- Step 1: Phrase seed identification --- p.65Chapter 4.3.2. --- Step 2: Phrase construction from phrase seeds --- p.65Chapter 4.4. --- Template identification procedure --- p.67Chapter 4.5. --- Experiment and result --- p.70Chapter 4.5.1. --- Testing data --- p.70Chapter 4.5.2. --- Details of experiments --- p.71Chapter 4.5.3. --- Experimental results --- p.72Chapter 4.5.3.1. --- Phrases and templates identified in financial news articles --- p.72Chapter 4.5.3.2. --- Phrases and templates identified in political news articles --- p.73Chapter 4.6. --- Conclusion --- p.74Chapter Chapter 5. --- A Corpus-based Induction Learning Approach to Improving the Accuracy of Chinese Word Segmentation --- p.76Chapter 5.1. --- Background of Chinese word segmentation --- p.77Chapter 5.2. --- Typical methods of Chinese word segmentation --- p.78Chapter 5.2.1. --- Syntactic and semantic approach --- p.78Chapter 5.2.2. --- Statistical approach --- p.79Chapter 5.2.3. --- Heuristic approach --- p.81Chapter 5.3. --- Problems in word segmentation --- p.82Chapter 5.3.1. --- Chinese word definition --- p.82Chapter 5.3.2. --- Word dictionary --- p.83Chapter 5.3.3. --- Word segmentation ambiguity --- p.84Chapter 5.4. --- Corpus-based induction learning approach to improving word segmentation accuracy --- p.86Chapter 5.4.1. --- Rationale of approach --- p.87Chapter 5.4.2. --- Method of constructing modification rules --- p.89Chapter 5.5. --- Experiment and results --- p.94Chapter 5.6. --- Characteristics of modification rules constructed in experiment --- p.96Chapter 5.7. --- Experiment constructing rules for compound words with suffixes --- p.98Chapter 5.8. --- Relationship between modification frequency and Zipfs first law --- p.99Chapter 5.9. --- Problems in the approach --- p.100Chapter 5.10. --- Conclusion --- p.101Chapter Chapter 6. --- Corpus-based Induction Learning Approach to Automatic Indexing of Controlled Index Terms --- p.103Chapter 6.1. --- Background of automatic indexing --- p.103Chapter 6.1.1. --- Definition of index term and indexing --- p.103Chapter 6.1.2. --- Manual indexing versus automatic indexing --- p.105Chapter 6.1.3. --- Different approaches to automatic indexing --- p.107Chapter 6.2. --- Corpus-based induction learning approach to automatic indexing --- p.109Chapter 6.2.1. --- Fundamental concept about corpus-based automatic indexing --- p.110Chapter 6.2.2. --- Procedure of automatic indexing --- p.111Chapter 6.2.2.1. --- Learning process --- p.112Chapter 6.2.2.2. --- Indexing process --- p.118Chapter 6.3. --- Experiments of corpus-based induction learning approach to automatic indexing --- p.118Chapter 6.3.1. --- An experiment evaluating the complete procedures --- p.119Chapter 6.3.1.1. --- Testing data used in the experiment --- p.119Chapter 6.3.1.2. --- Details of the experiment --- p.119Chapter 6.3.1.3. --- Experimental result --- p.121Chapter 6.3.2. --- An experiment comparing with the traditional approach --- p.122Chapter 6.3.3. --- An experiment determining the optimal indexing score threshold --- p.124Chapter 6.3.4. --- An experiment measuring the precision and recall of indexing performance --- p.127Chapter 6.4. --- Learning feedback modification --- p.128Chapter 6.4.1. --- Positive feedback --- p.129Chapter 6.4.2. --- Negative feedback --- p.131Chapter 6.4.3. --- Change of indexed proportions of positive/negative training corpus in feedback iterations --- p.132Chapter 6.4.4. --- An experiment evaluating the learning feedback modification --- p.134Chapter 6.4.5. --- An experiment testing the significance factor in merging process --- p.136Chapter 6.5. --- Conclusion --- p.138Chapter Chapter 7. --- Conclusion --- p.140Appendix A: Some examples of identified phrases in financial news articles --- p.149Appendix B: Some examples of identified templates in financial news articles --- p.150Appendix C: Some examples of texts containing the templates in financial news articles --- p.151Appendix D: Some examples of identified phrases in political news articles --- p.152Appendix E: Some examples of identified templates in political news articles --- p.153Appendix F: Some examples of texts containing the templates in political news articles --- p.154Appendix G: Syntactic tags used in word segmentation modification rule experiment --- p.155Appendix H: An example of semantic approach to automatic indexing --- p.156Appendix I: An example of syntactic approach to automatic indexing --- p.158Appendix J: Samples of INSPEC and MEDLINE Records --- p.161Appendix K: Examples of Promoting and Demoting Words --- p.162References --- p.16

    Automatic multi-label subject indexing in a multilingual environment

    Get PDF
    This paper presents an approach to automatically subject index fulltext documents with multiple labels based on binary support vector machines(SVM). The aim was to test the applicability of SVMs with a real world dataset. We have also explored the feasibility of incorporating multilingual background knowledge, as represented in thesauri or ontologies, into our text document representation for indexing purposes. The test set for our evaluations has been compiled from an extensive document base maintained by the Food and Agriculture Organization (FAO) of the United Nations (UN). Empirical results show that SVMs are a good method for automatic multi- label classification of documents in multiple languages

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Coherent Keyphrase Extraction via Web Mining

    Full text link
    Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).Comment: 6 pages, related work available at http://purl.org/peter.turney

    Extraction of Keyphrases from Text: Evaluation of Four Algorithms

    Get PDF
    This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithm’s keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsoft’s Word 97, (2) an algorithm based on Eric Brill’s part-of-speech tagger, (3) the Summarize feature in Verity’s Search 97, and (4) NRC’s Extractor algorithm. For all five document collections, NRC’s Extractor yields the best match with the manually generated keyphrases

    Learning to Extract Keyphrases from Text

    Get PDF
    Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft?s Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity?s Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97)

    An experiment with ontology mapping using concept similarity

    Get PDF
    This paper describes a system for automatically mapping between concepts in different ontologies. The motivation for the research stems from the Diogene project, in which the project's own ontology covering the ICT domain is mapped to external ontologies, in order that their associated content can automatically be included in the Diogene system. An approach involving measuring the similarity of concepts is introduced, in which standard Information Retrieval indexing techniques are applied to concept descriptions. A matrix representing the similarity of concepts in two ontologies is generated, and a mapping is performed based on two parameters: the domain coverage of the ontologies, and their levels of granularity. Finally, some initial experimentation is presented which suggests that our approach meets the project's unique set of requirements
    corecore