11 research outputs found
Improving document representation by accumulating relevance feedback : the relevance feedback accumulation (RFA) algorithm
Document representation (indexing) techniques are dominated by variants of the term-frequency analysis approach, based on the assumption that the more occurrences a term has throughout a document the more important the term is in that document. Inherent drawbacks associated with this approach include: poor index quality, high document representation size and the word mismatch problem. To tackle these drawbacks, a document representation improvement method called the Relevance Feedback Accumulation (RFA) algorithm is presented. The algorithm provides a mechanism to continuously accumulate relevance assessments over time and across users. It also provides a document representation modification function, or document representation learning function that gradually improves the quality of the document representations. To improve document representations, the learning function uses a data mining measure called support for analyzing the accumulated relevance feedback.
Evaluation is done by comparing the RFA algorithm to other four algorithms. The four measures used for evaluation are (a) average number of index terms per document; (b) the quality of the document representations assessed by human judges; (c) retrieval effectiveness; and (d) the quality of the document representation learning function. The evaluation results show that (1) the algorithm is able to substantially reduce the document representations size while maintaining retrieval effectiveness parameters; (2) the algorithm provides a smooth and steady document representation learning function; and (3) the algorithm improves the quality of the document representations. The RFA algorithm\u27s approach is consistent with efficiency considerations that hold in real information retrieval systems.
The major contribution made by this research is the design and implementation of a novel, simple, efficient, and scalable technique for document representation improvement
Generating Better Concept Hierarchies Using Automatic Document Classification
ABSTRACT This paper presents a hybrid concept hierarchy development technique for web returned documents retrieved by a meta-search engine. The aim of the technique is to separate the initial retrieved documents into topical oriented categories, prior to the actual concept hierarchy generation. The topical categories correspond to different semantic aspects of the query. This is done using a 1-of-n automatic document classification, on the initial set of returned documents. Then, an individual topical concept hierarchy is automatically generated inside each of the resulted categories. Both steps are executed on the fly at retrieval time. Due to the efficiency constraints imposed by the web retrieval context, the algorithm only uses document snippets (rather than full web pages) for both document classification and concept hierarchy generation. Experimental results show that the algorithm is able to improve the quality of the concept hierarchy presented to the searcher; at the same time, the efficiency parameters are kept within reasonable intervals
Improving Document Representations Using Relevance Feedback: The RFA Algorithm
In this paper we present a document representation improvement technique, named the Relevance Feedback Accumulation (RFA) algorithm. Using prior relevance feedback assessments and a data mining measure called “support”, the algorithm’s learning function gradually improves document representations, over time and across users. Results show that the modified document representations yield lower dimensionality while improving retrieval effectiveness. The algorithm is efficient and scalable, suited for retrieval systems managing large document collections
Li et al. Automatically Finding Significant Topical terms from Documents ABSTRACT Automatically Finding Significant Topical Terms from Documents
With the pervasion of digital textual data, text mining is becoming more and more important to deriving competitive advantages. One factor for successful text mining applications is the ability of finding significant topical terms for discovering interesting patterns or relationships. Document keyphrases are phrases carrying the most important topical concepts for a given document. In many applications, keyphrases as textual elements are better suited for text mining and could provide more discriminating power than single words. This paper describes an automatic keyphrase identification program (KIP). KIP’s algorithm examines the composition of noun phrases and calculates their scores by looking up a domain-specific glossary database; the ones with higher scores are extracted as keyphrases. KIP’s learning function can enrich its glossary database by automatically adding new identified keyphrases. KIP’s personalization feature allows the user build a glossary database specifically suitable for the area of his/her interest
Li et al. Incorporating Document Keyphrases in Search Results Incorporating Document Keyphrases in Search Results ABSTRACT
Effectiveness and efficiency of searching and returned results presentation is the key to a search engine. Before downloading and examining the document text, users usually first judge the relevance of a return hit to the query by looking at document metadata presented in the return result. However, the metadata coming with the return hit is usually not rich enough for users to predict the content of the document. Keyphrases provide a concise summary of a document’s content, offering subject metadata characterizing and summarizing document. In this paper, we propose a mechanism of enriching the metadata of the return results by incorporating automatically extracted document keyphrases in each return hit. By looking at the keyphrases in each return hit, the user can predict the content of the document more easily, quickly, and accurately. The experimental results show that our solution may save users time up to 32 % and users would like to use our proposed search interface with document keyphrases as part of the metadata of a return hit
A Hybrid Classifier Approach for Web Retrieved Documents Classification Abstract
The paper presents a hybrid technique for the classification of web returned hits into concept hierarchies. The technique involves a combination of manual and automatic classifiers. At first, all web returned documents are assigned to human defined categories using manual classifiers, and then automatic classifiers are used to generate a concept hierarchy for each of these categories. The results of the evaluation reveal the following: (a) for polysemous queries, our system is able to generate meaningful categories corresponding to (but not limited to), the different semantic facets of the queries; (b) as expected, for non-polysemous queries the system generates fewer categories; (c) the hierarchy precision of the concept hierarchies generated for polysemous queries is found to be significantly better when compared to the one obtained using a baseline system