8 research outputs found

    Term Clustering of Syntactic Phrases

    Get PDF
    Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations. In this paper we discuss our implementation of a syntactic phrase generator, as well as our preliminary experiments with producing phrase clusters. These experiments show small improvements in retrieval effectiveness resulting from the use of phrase clusters, but it is clear that corpora much larger than standard information retrieval test collections will be required to thoroughly evaluate the use of this technique

    Logical-Linguistic Model and Experiments in Document Retrieval

    Get PDF
    Conventional document retrieval systems have relied on the extensive use of the keyword approach with statistical parameters in their implementations. Now, it seems that such an approach has reached its upper limit of retrieval effectiveness, and therefore, new approaches should be investigated for the development of future systems. With current advances in hardware, programming languages and techniques, natural language processing and understanding, and generally, in the field of artificial intelligence, there are now attempts being made to include linguistic processing into document retrieval systems. Few attempts have been made to include parsing or syntactic analysis into document retrieval systems, and the results reported show some improvements in the level of retrieval effectiveness. The first part of this thesis sets out to investigate further the use of linguistic processing by including translation, instead of only parsing, into a document retrieval system. The translation process implemented is based on unification categorial grammar and uses C-Prolog as the building tool. It is used as the main part of the indexing process of documents and queries into a knowledge base predicate representation. Instead of using the vector space model to represent documents and queries, we have used a kind of knowledge base model which we call logical-linguistic model. A development of a robust parser-translator to perform the translation is discussed in detail in the thesis. A method of dealing with ambiguity is also incorporated in the parser-translator implementation. The retrieval process of this model is based on a logical implication process implemented in C-Prolog. In order to handle uncertainty in evaluating similarity values between documents and queries, meta level constructs are built upon the C-Prolog system. A logical meta language, called UNIL (UNcertain Implication Language), is proposed for controlling the implication process. Using UNIL, one can write a set of implication rules and thesaurus to define the matching function of a particular retrieval strategy. Thus, we have demonstrated and implemented the matching operation between a document and a query as an inference using unification. An inference from a document to a query is done in the context of global information represented by the implication rules and the thesaurus. A set of well structured experiments is performed with various retrieval strategies on a test collection of documents and queries in order to evaluate the performance of the system. The results obtained are analysed and discussed. The second part of the thesis sets out to implement and evaluate the imaging retrieval strategy as originally defined by van Rijsbergen. The imaging retrieval is implemented as a relevance feedback retrieval with nearest neighbour information which is defined as follows. One of the best retrieval strategies from the earlier experiments is chosen to perform the initial ranking of the documents, and a few top ranked documents will be retrieved and identified as relevant or not by the user. From this set of retrieved and relevant documents, we can obtain all other unretrieved documents which have any of the retrieved and relevant documents as their nearest neighbour. These unretrieved documents have the potential of also being relevant since they are 'close' to the retrieved and relevant ones, and thus their initial similarity values to the query will be updated according to their distances from their nearest neighbours. From the updated similarity values, a new ranking of documents can be obtained and evaluated. A few sets of experiments using imaging retrieval strategy are performed for the following objectives: to search for an appropriate updating function in order to produce a new ranking of documents, to determine an appropriate nearest neighbour set, to find the relationship of the retrieval effectiveness to the size of the documents shown to the user for relevance judgement, and lastly, to find the effectiveness of a multi-stage imaging retrieval. The results obtained are analysed and discussed. Generally, the thesis sets out to define the logical-linguistic model in document retrieval and demonstrates it by building an experimental system which will be referred to as SILOL (a Simple Logical-linguistic document retrieval system). A set of retrieval strategies will be experimented with and the results obtained will be analysed and discussed

    Approximate content match of multimedia data with natural language queries.

    Get PDF
    Wong Kit-pui.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 117-119).ACKNOWLEDGMENT --- p.4ABSTRACT --- p.6KEYWORDS --- p.7Chapter Chapter 1 --- INTRODUCTION --- p.9Chapter Chapter 2 --- APPROACH --- p.14Chapter 2.1 --- Challenges --- p.15Chapter 2.2 --- Knowledge Representation --- p.16Chapter 2.3 --- Proposed Information Model --- p.17Chapter 2.4 --- Restricted Language Set --- p.20Chapter Chapter 3 --- THEORY --- p.26Chapter 3.1 --- Features --- p.26Chapter 3.1.1 --- Superficial Details --- p.30Chapter 3.1.2 --- Hidden Details --- p.31Chapter 3.2 --- Matching Process --- p.36Chapter 3.2.1 --- Inexact Match --- p.37Chapter 3.2.2 --- An Illustration --- p.38Chapter 3.2.2.1 --- Stage 1 - Query Parsing --- p.39Chapter 3.2.2.2 --- Stage 2 - Gross Filtering --- p.41Chapter 3.2.2.3 --- Stage 3 - Fine Scoring --- p.42Chapter 3.3 --- Extending Knowledge --- p.46Chapter 3.3.1 --- Attributes with Intermediate Closeness --- p.47Chapter 3.3.2 --- Comparing Different Entities --- p.48Chapter 3.4 --- Putting Concepts to Work --- p.50Chapter Chapter 4 --- IMPLEMENTATION --- p.52Chapter 4.1 --- Overall Structure --- p.53Chapter 4.2 --- Choosing NL Parser --- p.55Chapter 4.3 --- Ambiguity --- p.56Chapter 4.4 --- Storing Knowledge --- p.59Chapter 4.4.1 --- Type Hierarchy --- p.60Chapter 4.4.1.1 --- Node Name --- p.61Chapter 4.4.1.2 --- Node Identity --- p.61Chapter 4.4.1.3 --- Operations --- p.68Chapter 4.4.1.3.1 --- Direct Edit --- p.68Chapter 4.4.1.3.2 --- Interactive Edit --- p.68Chapter 4.4.2 --- Implicit Features --- p.71Chapter 4.4.3 --- Database of Captions --- p.72Chapter 4.4.4 --- Explicit Features --- p.73Chapter 4.4.5 --- Transformation Map --- p.74Chapter Chapter 5 --- ILLUSTRATION --- p.78Chapter 5.1 --- Gloss Tags --- p.78Chapter 5.2 --- Parsing --- p.81Chapter 5.2.1 --- Resolving Nouns and Verbs --- p.81Chapter 5.2.2 --- Resolving Adjectives and Adverbs --- p.84Chapter 5.2.3 --- Normalizing Features --- p.89Chapter 5.2.4 --- Resolving Prepositions --- p.90Chapter 5.3 --- Matching --- p.93Chapter 5.3.1 --- Gross Filtering --- p.94Chapter 5.3.2 --- Fine Scoring --- p.96Chapter Chapter 6 --- DISCUSSION --- p.101Chapter 6.1 --- Performance Measures --- p.101Chapter 6.1.1 --- General Parameters --- p.101Chapter 6.1.2 --- Experiments --- p.103Chapter 6.1.2.1 --- Inexact Matching Behaviour --- p.103Chapter 6.1.2.2 --- Exact Matching Behaviour --- p.106Chapter 6.2 --- Difficulties --- p.108Chapter 6.3 --- Possible Improvement --- p.110Chapter 6.4 --- Conclusion --- p.112REFERENCES --- p.117APPENDICES --- p.121Appendix A Notation --- p.121Appendix B Glossary --- p.123Appendix C Proposed Feature Slots and Value --- p.126Appendix D Sample Captions and Queries --- p.128Appendix E Manual Pages --- p.130Appendix F Directory Structure --- p.136Appendix G Imported Toolboxes --- p.137Appendix H Program Listing --- p.14

    Un modèle de recherche d'information basé sur les graphes et les similarités structurelles pour l'amélioration du processus de recherche d'information

    Get PDF
    The main objective of IR systems is to select relevant documents, related to a user's information need, from a collection of documents. Traditional approaches for document/query comparison use surface similarity, i.e. the comparison engine uses surface attributes (indexing terms). We propose a new method which uses a special kind of similarity, namely structural similarities (similarities that use both surface attributes and relation between attributes). These similarities were inspired from cognitive studies and a general similarity measure based on node comparison in a bipartite graph. We propose an adaptation of this general method to the special context of information retrieval. Adaptation consists in taking into account the domain specificities: data type, weighted edges, normalization choice. The core problem is how documents are compared against queries. The idea we develop is that similar documents will share similar terms and similar terms will appear in similar documents. We have developed an algorithm which traduces this idea. Then we have study problem related to convergence and complexity, then we have produce some test on classical collection and compare our measure with two others that are references in our domain. The Report is structured in five chapters: First chapter deals with comparison problem, and related concept like similarities, we explain different point of view and propose an analogy between cognitive similarity model and IR model. In the second chapter we present the IR task, test collection and measures used to evaluate a relevant document list. The third chapter introduces graph definition: our model is based on graph bipartite representation, so we define graphs and criterions used to evaluate them. The fourth chapter describe how we have adopted, and adapted the general comparison method. The Fifth chapter describes how we evaluate the ordering performance of our method, and also how we have compared our method with two others.Cette thèse d'informatique s'inscrit dans le domaine de la recherche d'information (RI). Elle a pour objet la création d'un modèle de recherche utilisant les graphes pour en exploiter la structure pour la détection de similarités entre les documents textuels d'une collection donnée et une requête utilisateur en vue d'améliorer le processus de recherche d'information. Ces similarités sont dites « structurelles » et nous montrons qu'elles apportent un gain d'information bénéfique par rapport aux seules similarités directes. Le rapport de thèse est structuré en cinq chapitres. Le premier chapitre présente un état de l'art sur la comparaison et les notions connexes que sont la distance et la similarité. Le deuxième chapitre présente les concepts clés de la RI, notamment l'indexation des documents, leur comparaison, et l'évaluation des classements retournés. Le troisième chapitre est consacré à la théorie des graphes et introduit les notations et notions liées à la représentation par graphe. Le quatrième chapitre présente pas à pas la construction de notre modèle pour la RI, puis, le cinquième chapitre décrit son application dans différents cas de figure, ainsi que son évaluation sur différentes collections et sa comparaison à d'autres approches
    corecore