21 research outputs found
Accurate user directed summarization from existing tools
This paper describes a set of experimental
results produced from the TIPSTER
SUMMAC initiative on user directed
summaries: document summaries generated in
the context of an information need expressed
as a query. The summarizer that was
evaluated was based on a set of existing
statistical techniques that had been applied
successfully to the INQUERY retrieval system.
The techniques proved to have a wider utility,
however, as the summarizer was one of the
better performing systems in the SUMMAC
evaluation. The design of this summarizer is
presented with a range of evaluations: both
those provided by SUMMAC as well as a set of
preliminary, more informal, evaluations that
examined additional aspects of the summaries.
Amongst other conclusions, the results reveal
that users can judge the relevance of
documents from their summary almost as
accurately as if they had had access to the
documentās full text
Entity Query Feature Expansion Using Knowledge Base Links
Recent advances in automatic entity linking and knowledge base
construction have resulted in entity annotations for document and
query collections. For example, annotations of entities from large
general purpose knowledge bases, such as Freebase and the Google
Knowledge Graph. Understanding how to leverage these entity
annotations of text to improve ad hoc document retrieval is an open
research area. Query expansion is a commonly used technique to
improve retrieval effectiveness. Most previous query expansion
approaches focus on text, mainly using unigram concepts. In this
paper, we propose a new technique, called entity query feature
expansion (EQFE) which enriches the query with features from
entities and their links to knowledge bases, including structured
attributes and text. We experiment using both explicit query entity
annotations and latent entities. We evaluate our technique on TREC
text collections automatically annotated with knowledge base entity
links, including the Google Freebase Annotations (FACC1) data.
We find that entity-based feature expansion results in significant
improvements in retrieval effectiveness over state-of-the-art text
expansion approaches
Indexing and retrieval of broadcast news
This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval
system. We discuss the development of a realtime Broadcast News speech recognizer, and its integration into an SDR system. Two advances were made for this task: automatic segmentation and statistical query expansion using a secondary
corpus. Precision and recall results using the Text Retrieval Conference (TREC) SDR evaluation infrastructure are reported throughout the paper, and we discuss the application of these developments to a large scale SDR task based on
an archive of British English broadcast news
Improving document representation by accumulating relevance feedback : the relevance feedback accumulation (RFA) algorithm
Document representation (indexing) techniques are dominated by variants of the term-frequency analysis approach, based on the assumption that the more occurrences a term has throughout a document the more important the term is in that document. Inherent drawbacks associated with this approach include: poor index quality, high document representation size and the word mismatch problem. To tackle these drawbacks, a document representation improvement method called the Relevance Feedback Accumulation (RFA) algorithm is presented. The algorithm provides a mechanism to continuously accumulate relevance assessments over time and across users. It also provides a document representation modification function, or document representation learning function that gradually improves the quality of the document representations. To improve document representations, the learning function uses a data mining measure called support for analyzing the accumulated relevance feedback.
Evaluation is done by comparing the RFA algorithm to other four algorithms. The four measures used for evaluation are (a) average number of index terms per document; (b) the quality of the document representations assessed by human judges; (c) retrieval effectiveness; and (d) the quality of the document representation learning function. The evaluation results show that (1) the algorithm is able to substantially reduce the document representations size while maintaining retrieval effectiveness parameters; (2) the algorithm provides a smooth and steady document representation learning function; and (3) the algorithm improves the quality of the document representations. The RFA algorithm\u27s approach is consistent with efficiency considerations that hold in real information retrieval systems.
The major contribution made by this research is the design and implementation of a novel, simple, efficient, and scalable technique for document representation improvement
Term Association Modelling in Information Retrieval
Many traditional Information Retrieval (IR) models assume that query terms are independent of each other. For those models, a document is normally represented as a bag of words/terms and their frequencies. Although traditional retrieval models can achieve reasonably good performance in many applications, the corresponding independence assumption has limitations. There are some recent studies that investigate how to model term associations/dependencies by proximity measures. However, the modeling of term associations theoretically under the probabilistic retrieval framework is still largely unexplored.
In this thesis, I propose a new concept named Cross Term, to model term proximity, with the aim of boosting retrieval performance. With Cross Terms, the association of multiple query terms can be modeled in the same way as a simple unigram term. In particular, an occurrence of a query term is assumed to have an impact on its neighboring text. The degree of the query term impact gradually weakens with increasing distance from the place of occurrence. Shape functions are used to characterize such impacts. Based on this assumption, I first propose a bigram CRoss TErm Retrieval (CRTER2) model for probabilistic IR and a Language model based model CRTER2LM. Specifically, a bigram Cross Term occurs when the corresponding query terms appear close to each other, and its impact can be modeled by the intersection of the respective shape functions of the query terms. Second, I propose a generalized n-gram CRoss TErm Retrieval (CRTERn) model recursively for n query terms where n>2. For n-gram Cross Term, I develop several distance metrics with different properties and employ them in the proposed models for ranking. Third, an enhanced context-sensitive proximity model is proposed to boost the CRTER models, where the contextual relevance of term proximity is studied. The models are validated on several large standard data sets, and show improved performance over other state-of-art approaches. I also discusse the practical impact of the proposed models. The approaches in this thesis can also provide helpful benefit for term association modeling in other domains
A Boolean based Question Answering System
The search engine searches the information according to the key words and provides users with related links, which need users to review and find the direct information among a large number of webpages. To avoid this drawback and improve the search results from search engine, we implemented a Boolean based Question Answering System. This system used Boolean Retrieval Model to analyze and match the text information from corresponding webpages in the document indexing step when users ask a Boolean expression based question. To evaluate system and analyze Boolean Retrieval Model, we used the data set from TREC (Text Retrieval Conference) to finish our experiment. Different Boolean operators in the questions such as AND, OR has been evaluated separately which is clear to analyze the effectiveness for each of them. We also evaluate the overall performance for this system