95 research outputs found
Combining and selecting characteristics of information use
In this paper we report on a series of experiments designed to investigate the combination of term and document weighting functions in Information Retrieval. We describe a series of weighting functions, each of which is based on how information is used within documents and collections, and use these weighting functions in two types of experiments: one based on combination of evidence for ad-hoc retrieval, the other based on selective combination of evidence within a relevance feedback situation. We discuss the difficulties involved in predicting good combinations of evidence for ad-hoc retrieval, and suggest the factors that may lead to the success or failure of combination. We also demonstrate how, in a relevance feedback situation, the relevance assessments can provide a good indication of how evidence should be selected for query term weighting. The use of relevance information to guide the combination process is shown to reduce the variability inherent in combination of evidence
Web news classification using neural networks based on PCA
In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features (CPBF). The fixed number of regular words from each class will be used as a feature vectors with the reduced features from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM provides acceptable classification accuracy with the sports news datasets
Effect of Tunable Indexing on Term Distribution and Cluster-based Information Retrieval Performance
The purpose of this study is to investigate the effect of tunable indexing on the structure and information retrieval performance of a clustered document database. The generation
of all cluster structures and calculation of term discrimination values is based upon the Cover Coefficient-Based Clustering Methodology. Information retrieval performance is
measured in terms of precision, recall, and e-measure. The relationship between term generality and term discrimination value is quantified using the Pearson Rank Correlation
Coefficient Test. The effect of tunable indexing on index term distribution and on the number of target clusters is examined
Learning Syntactic Rules and Tags with Genetic Algorithms for Information Retrieval and Filtering: An Empirical Basis for Grammatical Rules
The grammars of natural languages may be learned by using genetic algorithms
that reproduce and mutate grammatical rules and part-of-speech tags, improving
the quality of later generations of grammatical components. Syntactic rules are
randomly generated and then evolve; those rules resulting in improved parsing
and occasionally improved retrieval and filtering performance are allowed to
further propagate. The LUST system learns the characteristics of the language
or sublanguage used in document abstracts by learning from the document
rankings obtained from the parsed abstracts. Unlike the application of
traditional linguistic rules to retrieval and filtering applications, LUST
develops grammatical structures and tags without the prior imposition of some
common grammatical assumptions (e.g., part-of-speech assumptions), producing
grammars that are empirically based and are optimized for this particular
application.Comment: latex document, postscript figures not included. Accepted for
publication in Information Processing and Managemen
Recommended from our members
Towards Nootropia : a non-linear approach to adaptive document filtering
In recent years, it has become increasingly difficult for users to find relevant information within the accessible glut. Research in Information Filtering (IF) tackles this problem through a tailored representation of the user interests, a user profile. Traditionally, IF inherits techniques from the related and more well established domains of Information Retrieval and Text Categorisation. These include, linear profile representations that exclude term dependencies and may only effectively represent a single topic of interest, and linear learning algorithms that achieve a steady profile adaptation pace. We argue that these practices are not attuned to the dynamic nature of user interests. A user may be interested in more than one topic in parallel, and both frequent variations and occasional radical changes of interests are inevitable over time. With our experimental system "Nootropia", we achieve adaptive document filtering with a single, multi-topic user profile. A hierarchical term network that takes into account topical and lexical correlations between terms and identifies topic-subtopic relations between them, is used to represent a user's multiple topics of interest and distinguish between them. A series of non-linear document evaluation functions is then established on the hierarchical network. Experiments using a variation of TREC's routing subtask to test the ability of a single profile to represent two and three topics of interest, reveal the approach's superiority over a linear profile representation. Adaptation of this single, multi-topic profile to a variety of changes in the user interests, is achieved through a process of self-organisation that constantly readjusts the profile stucturally, in response to user feedback. We used virtual users and another variation of TREC's routing subtask to test the profile on two learning and two forgetting tasks. The results clearly indicate the profile's ability to adapt to both frequent variations and radical changes in user interests
Retrieval of Spoken Documents: First Experiences (Research Report TR-1997-34)
We report on our first experiences in dealing with the retrieval of spoken documents. While lacking the tools and know-how for performing speech recognition on the spoken documents, we tried to use in the best possible way our knowledge of probabilistic indexing and retrieval of textual documents. The techniques we used and the results we obtained are encouraging, motivating our future involvement in other further experimentation in this new area of research
Information Retrieval Performance Enhancement Using The Average Standard Estimator And The Multi-criteria Decision Weighted Set
Information retrieval is much more challenging than traditional small document collection retrieval. The main difference is the importance of correlations between related concepts in complex data structures. These structures have been studied by several information retrieval systems. This research began by performing a comprehensive review and comparison of several techniques of matrix dimensionality estimation and their respective effects on enhancing retrieval performance using singular value decomposition and latent semantic analysis. Two novel techniques have been introduced in this research to enhance intrinsic dimensionality estimation, the Multi-criteria Decision Weighted model to estimate matrix intrinsic dimensionality for large document collections and the Average Standard Estimator (ASE) for estimating data intrinsic dimensionality based on the singular value decomposition (SVD). ASE estimates the level of significance for singular values resulting from the singular value decomposition. ASE assumes that those variables with deep relations have sufficient correlation and that only those relationships with high singular values are significant and should be maintained. Experimental results over all possible dimensions indicated that ASE improved matrix intrinsic dimensionality estimation by including the effect of both singular values magnitude of decrease and random noise distracters. Analysis based on selected performance measures indicates that for each document collection there is a region of lower dimensionalities associated with improved retrieval performance. However, there was clear disagreement between the various performance measures on the model associated with best performance. The introduction of the multi-weighted model and Analytical Hierarchy Processing (AHP) analysis helped in ranking dimensionality estimation techniques and facilitates satisfying overall model goals by leveraging contradicting constrains and satisfying information retrieval priorities. ASE provided the best estimate for MEDLINE intrinsic dimensionality among all other dimensionality estimation techniques, and further, ASE improved precision and relative relevance by 10.2% and 7.4% respectively. AHP analysis indicates that ASE and the weighted model ranked the best among other methods with 30.3% and 20.3% in satisfying overall model goals in MEDLINE and 22.6% and 25.1% for CRANFIELD. The weighted model improved MEDLINE relative relevance by 4.4%, while the scree plot, weighted model, and ASE provided better estimation of data intrinsic dimensionality for CRANFIELD collection than Kaiser-Guttman and Percentage of variance. ASE dimensionality estimation technique provided a better estimation of CISI intrinsic dimensionality than all other tested methods since all methods except ASE tend to underestimate CISI document collection intrinsic dimensionality. ASE improved CISI average relative relevance and average search length by 28.4% and 22.0% respectively. This research provided evidence supporting a system using a weighted multi-criteria performance evaluation technique resulting in better overall performance than a single criteria ranking model. Thus, the weighted multi-criteria model with dimensionality reduction provides a more efficient implementation for information retrieval than using a full rank model
A stemming algorithm for Latvian
The thesis covers construction, application and evaluation of a stemming algorithm for
advanced information searching and retrieval in Latvian databases. Its aim is to examine
the following two questions:
Is it possible to apply for Latvian a suffix removal algorithm originally designed
for English?
Can stemming in Latvian produce the same or better information retrieval results
than manual truncation?
In order to achieve these aims, the role and importance of automatic word conflation
both for document indexing and information retrieval are characterised. A review of
literature, which analyzes and evaluates different types of stemming techniques and
retrospective development of stemming algorithms, justifies the necessity to apply this
advanced IR method also for Latvian. Comparative analysis of morphological structure
both for English and Latvian language determined the selection of Porter's suffix
removal algorithm as a basis for the Latvian sternmer.
An extensive list of Latvian stopwords including conjunctions, particles and adverbs,
was designed and added to the initial sternmer in order to eliminate insignificant words
from further processing. A number of specific modifications and changes related to the
Latvian language were carried out to the structure and rules of the original stemming
algorithm.
Analysis of word stemming based on Latvian electronic dictionary and Latvian text
fragments confirmed that the suffix removal technique can be successfully applied also
to Latvian language. An evaluation study of user search statements revealed that the
stemming algorithm to a certain extent can improve effectiveness of information
retrieval
Providing personalised information based on individual interests and preferences.
The main aim of personalised Information Retrieval (IR) is to provide an effective IR system whereby relevant information can be presented according to individual users' interests and preferences. In response to their queries, all Web users expect to obtain the search results in a rank order with the most relevant items at the lowest ranks. Effective IR systems rank the less relevant documents below the relevant documents. However, a commonly stated problem of Web browsers is to match the users' queries to the information base. The key challenge is to return a list of search results containing a low level of non-relevant documents while not missing out the relevant documents.To address this problem, keyword-based search of Vector Space Model is employed as an IR technique to model the Web users and build their interest profiles. Semantic-based search through Ontology is further employed to represent documents matching the users' needs without being directly contained in the users' specified keywords. The users' log files are one of the most important sources from which implicit feedback is detected through their profiles. These provide valuable information based on which alternative learning approaches (i.e. dwell-based search) can be incorporated into the IR standard measures (i.e. tf-idf) allowing a further improvement of personalisation of Web document search, thus increasing the performance of IR systems.To incorporate such a non-textual data type (i.e. dwell) into the hybridisation of the keyword-based and semantic-based searches entails a complex interaction of information attributes in the index structure. A dwell-based filter called dwell-tf-ldf that allows a standard tokeniser to be converted into a keyword tokeniser is thus proposed. The proposed filter uses an efficient hybrid indexing technique to bring textual and non-textual data types under one umbrella, thus making a move beyond simple keyword matching to improve future retrieval applications for web browsers. Adopting precision and recall, the most common evaluation measure, the superiority of the hybridisation of these approaches lies in pushing significantly relevant documents to the top of the ranked lists, as compared to any traditional search system. The results were empirically confirmed through human subjects who conducted several real-life Web searches
Abduction, Explanation and Relevance Feedback
Selecting good query terms to represent an information need is difficult. The complexity of verbalising an information need can increase when the need is vague, when the document collection is unfamiliar or when the searcher is inexperienced with information retrieval (IR) systems. It is much easier, however, for a user to assess which documents contain relevant information. Relevance feedback (RF) techniques make use of this fact to automatically modify a query representation based on the documents a user considers relevant. RF has proved to be relatively successful at increasing the effectiveness of retrieval systems in certain types of search, and RF techniques have gradually appeared in operational systems and even some Web engines. However, the traditional approaches to RF do not consider the behavioural aspects of information seeking. The standard RF algorithms consider only what documents the user has marked as relevant; they do not consider how the user has assessed relevance. For RF to become an effective support to information seeking it is imperative to develop new models of RF that are capable of incorporating how users make relevance assessments. In this thesis I view RF as a process of explanation. A RF theory should provide an explanation of why a document is relevant to an information need. Such an explanation can be based on how information is used within documents. I use abductive inference to provide a framework for an explanation-based account of RF. Abductive inference is specifically designed as a technique for generating explanations of complex events, and has been widely used in a range of diagnostic systems. Such a framework is capable of producing a set of possible explanations for why a user marked a number of documents relevant at the current search iteration. The choice of which explanation to use is guided by information on how the user has interacted with the system---how many documents they have marked relevant, where in the document ranking the relevant documents occur and the relevance score given to a document by the user. This behavioural information is used to create explanations and to choose which type of explanation is required in the search. The explanation is then used as the basis of a modified query to be submitted to the system. I also investigate how the notion of explanation can be used at the interface to encourage more use of RF by searchers
- …