95 research outputs found

    Combining and selecting characteristics of information use

    Get PDF
    In this paper we report on a series of experiments designed to investigate the combination of term and document weighting functions in Information Retrieval. We describe a series of weighting functions, each of which is based on how information is used within documents and collections, and use these weighting functions in two types of experiments: one based on combination of evidence for ad-hoc retrieval, the other based on selective combination of evidence within a relevance feedback situation. We discuss the difficulties involved in predicting good combinations of evidence for ad-hoc retrieval, and suggest the factors that may lead to the success or failure of combination. We also demonstrate how, in a relevance feedback situation, the relevance assessments can provide a good indication of how evidence should be selected for query term weighting. The use of relevance information to guide the combination process is shown to reduce the variability inherent in combination of evidence

    Web news classification using neural networks based on PCA

    Get PDF
    In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features (CPBF). The fixed number of regular words from each class will be used as a feature vectors with the reduced features from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM provides acceptable classification accuracy with the sports news datasets

    Effect of Tunable Indexing on Term Distribution and Cluster-based Information Retrieval Performance

    Get PDF
    The purpose of this study is to investigate the effect of tunable indexing on the structure and information retrieval performance of a clustered document database. The generation of all cluster structures and calculation of term discrimination values is based upon the Cover Coefficient-Based Clustering Methodology. Information retrieval performance is measured in terms of precision, recall, and e-measure. The relationship between term generality and term discrimination value is quantified using the Pearson Rank Correlation Coefficient Test. The effect of tunable indexing on index term distribution and on the number of target clusters is examined

    Learning Syntactic Rules and Tags with Genetic Algorithms for Information Retrieval and Filtering: An Empirical Basis for Grammatical Rules

    Get PDF
    The grammars of natural languages may be learned by using genetic algorithms that reproduce and mutate grammatical rules and part-of-speech tags, improving the quality of later generations of grammatical components. Syntactic rules are randomly generated and then evolve; those rules resulting in improved parsing and occasionally improved retrieval and filtering performance are allowed to further propagate. The LUST system learns the characteristics of the language or sublanguage used in document abstracts by learning from the document rankings obtained from the parsed abstracts. Unlike the application of traditional linguistic rules to retrieval and filtering applications, LUST develops grammatical structures and tags without the prior imposition of some common grammatical assumptions (e.g., part-of-speech assumptions), producing grammars that are empirically based and are optimized for this particular application.Comment: latex document, postscript figures not included. Accepted for publication in Information Processing and Managemen

    Retrieval of Spoken Documents: First Experiences (Research Report TR-1997-34)

    Get PDF
    We report on our first experiences in dealing with the retrieval of spoken documents. While lacking the tools and know-how for performing speech recognition on the spoken documents, we tried to use in the best possible way our knowledge of probabilistic indexing and retrieval of textual documents. The techniques we used and the results we obtained are encouraging, motivating our future involvement in other further experimentation in this new area of research

    Information Retrieval Performance Enhancement Using The Average Standard Estimator And The Multi-criteria Decision Weighted Set

    Get PDF
    Information retrieval is much more challenging than traditional small document collection retrieval. The main difference is the importance of correlations between related concepts in complex data structures. These structures have been studied by several information retrieval systems. This research began by performing a comprehensive review and comparison of several techniques of matrix dimensionality estimation and their respective effects on enhancing retrieval performance using singular value decomposition and latent semantic analysis. Two novel techniques have been introduced in this research to enhance intrinsic dimensionality estimation, the Multi-criteria Decision Weighted model to estimate matrix intrinsic dimensionality for large document collections and the Average Standard Estimator (ASE) for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). ASE estimates the level of significance for singular values resulting from the singular value decomposition. ASE assumes that those variables with deep relations have sufficient correlation and that only those relationships with high singular values are significant and should be maintained. Experimental results over all possible dimensions indicated that ASE improved matrix intrinsic dimensionality estimation by including the effect of both singular values magnitude of decrease and random noise distracters. Analysis based on selected performance measures indicates that for each document collection there is a region of lower dimensionalities associated with improved retrieval performance. However, there was clear disagreement between the various performance measures on the model associated with best performance. The introduction of the multi-weighted model and Analytical Hierarchy Processing (AHP) analysis helped in ranking dimensionality estimation techniques and facilitates satisfying overall model goals by leveraging contradicting constrains and satisfying information retrieval priorities. ASE provided the best estimate for MEDLINE intrinsic dimensionality among all other dimensionality estimation techniques, and further, ASE improved precision and relative relevance by 10.2% and 7.4% respectively. AHP analysis indicates that ASE and the weighted model ranked the best among other methods with 30.3% and 20.3% in satisfying overall model goals in MEDLINE and 22.6% and 25.1% for CRANFIELD. The weighted model improved MEDLINE relative relevance by 4.4%, while the scree plot, weighted model, and ASE provided better estimation of data intrinsic dimensionality for CRANFIELD collection than Kaiser-Guttman and Percentage of variance. ASE dimensionality estimation technique provided a better estimation of CISI intrinsic dimensionality than all other tested methods since all methods except ASE tend to underestimate CISI document collection intrinsic dimensionality. ASE improved CISI average relative relevance and average search length by 28.4% and 22.0% respectively. This research provided evidence supporting a system using a weighted multi-criteria performance evaluation technique resulting in better overall performance than a single criteria ranking model. Thus, the weighted multi-criteria model with dimensionality reduction provides a more efficient implementation for information retrieval than using a full rank model

    A stemming algorithm for Latvian

    Get PDF
    The thesis covers construction, application and evaluation of a stemming algorithm for advanced information searching and retrieval in Latvian databases. Its aim is to examine the following two questions: Is it possible to apply for Latvian a suffix removal algorithm originally designed for English? Can stemming in Latvian produce the same or better information retrieval results than manual truncation? In order to achieve these aims, the role and importance of automatic word conflation both for document indexing and information retrieval are characterised. A review of literature, which analyzes and evaluates different types of stemming techniques and retrospective development of stemming algorithms, justifies the necessity to apply this advanced IR method also for Latvian. Comparative analysis of morphological structure both for English and Latvian language determined the selection of Porter's suffix removal algorithm as a basis for the Latvian sternmer. An extensive list of Latvian stopwords including conjunctions, particles and adverbs, was designed and added to the initial sternmer in order to eliminate insignificant words from further processing. A number of specific modifications and changes related to the Latvian language were carried out to the structure and rules of the original stemming algorithm. Analysis of word stemming based on Latvian electronic dictionary and Latvian text fragments confirmed that the suffix removal technique can be successfully applied also to Latvian language. An evaluation study of user search statements revealed that the stemming algorithm to a certain extent can improve effectiveness of information retrieval

    Providing personalised information based on individual interests and preferences.

    Get PDF
    The main aim of personalised Information Retrieval (IR) is to provide an effective IR system whereby relevant information can be presented according to individual users' interests and preferences. In response to their queries, all Web users expect to obtain the search results in a rank order with the most relevant items at the lowest ranks. Effective IR systems rank the less relevant documents below the relevant documents. However, a commonly stated problem of Web browsers is to match the users' queries to the information base. The key challenge is to return a list of search results containing a low level of non-relevant documents while not missing out the relevant documents.To address this problem, keyword-based search of Vector Space Model is employed as an IR technique to model the Web users and build their interest profiles. Semantic-based search through Ontology is further employed to represent documents matching the users' needs without being directly contained in the users' specified keywords. The users' log files are one of the most important sources from which implicit feedback is detected through their profiles. These provide valuable information based on which alternative learning approaches (i.e. dwell-based search) can be incorporated into the IR standard measures (i.e. tf-idf) allowing a further improvement of personalisation of Web document search, thus increasing the performance of IR systems.To incorporate such a non-textual data type (i.e. dwell) into the hybridisation of the keyword-based and semantic-based searches entails a complex interaction of information attributes in the index structure. A dwell-based filter called dwell-tf-ldf that allows a standard tokeniser to be converted into a keyword tokeniser is thus proposed. The proposed filter uses an efficient hybrid indexing technique to bring textual and non-textual data types under one umbrella, thus making a move beyond simple keyword matching to improve future retrieval applications for web browsers. Adopting precision and recall, the most common evaluation measure, the superiority of the hybridisation of these approaches lies in pushing significantly relevant documents to the top of the ranked lists, as compared to any traditional search system. The results were empirically confirmed through human subjects who conducted several real-life Web searches

    Abduction, Explanation and Relevance Feedback

    Get PDF
    Selecting good query terms to represent an information need is difficult. The complexity of verbalising an information need can increase when the need is vague, when the document collection is unfamiliar or when the searcher is inexperienced with information retrieval (IR) systems. It is much easier, however, for a user to assess which documents contain relevant information. Relevance feedback (RF) techniques make use of this fact to automatically modify a query representation based on the documents a user considers relevant. RF has proved to be relatively successful at increasing the effectiveness of retrieval systems in certain types of search, and RF techniques have gradually appeared in operational systems and even some Web engines. However, the traditional approaches to RF do not consider the behavioural aspects of information seeking. The standard RF algorithms consider only what documents the user has marked as relevant; they do not consider how the user has assessed relevance. For RF to become an effective support to information seeking it is imperative to develop new models of RF that are capable of incorporating how users make relevance assessments. In this thesis I view RF as a process of explanation. A RF theory should provide an explanation of why a document is relevant to an information need. Such an explanation can be based on how information is used within documents. I use abductive inference to provide a framework for an explanation-based account of RF. Abductive inference is specifically designed as a technique for generating explanations of complex events, and has been widely used in a range of diagnostic systems. Such a framework is capable of producing a set of possible explanations for why a user marked a number of documents relevant at the current search iteration. The choice of which explanation to use is guided by information on how the user has interacted with the system---how many documents they have marked relevant, where in the document ranking the relevant documents occur and the relevance score given to a document by the user. This behavioural information is used to create explanations and to choose which type of explanation is required in the search. The explanation is then used as the basis of a modified query to be submitted to the system. I also investigate how the notion of explanation can be used at the interface to encourage more use of RF by searchers
    corecore