11,096 research outputs found
Retrieving with good sense
Although always present in text, word sense ambiguity only recently became regarded as a problem to information
retrieval which was potentially solvable. The growth of interest in word senses resulted from new directions taken in
disambiguation research. This paper first outlines this research and surveys the resulting efforts in information
retrieval. Although the majority of attempts to improve retrieval effectiveness were unsuccessful, much was learnt
from the research. Most notably a notion of under what circumstance disambiguation may prove of use to retrieval
Accurate user directed summarization from existing tools
This paper describes a set of experimental
results produced from the TIPSTER
SUMMAC initiative on user directed
summaries: document summaries generated in
the context of an information need expressed
as a query. The summarizer that was
evaluated was based on a set of existing
statistical techniques that had been applied
successfully to the INQUERY retrieval system.
The techniques proved to have a wider utility,
however, as the summarizer was one of the
better performing systems in the SUMMAC
evaluation. The design of this summarizer is
presented with a range of evaluations: both
those provided by SUMMAC as well as a set of
preliminary, more informal, evaluations that
examined additional aspects of the summaries.
Amongst other conclusions, the results reveal
that users can judge the relevance of
documents from their summary almost as
accurately as if they had had access to the
document’s full text
Word sense disambiguation and information retrieval
It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval
(IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will
increase. However, recent research into the application of a word sense disambiguator to an IR system failed
to show any performance increase. From these results it has become clear that more basic research is needed
to investigate the relationship between sense ambiguity, disambiguation, and IR.
Using a technique that introduces additional sense ambiguity into a collection, this paper presents research
that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have
on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an IR system
when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to
be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of
accuracy
Revisiting h measured on UK LIS and IR academics
A brief communication appearing in this journal ranked UK LIS and (some) IR academics by their h-index
using data derived from Web of Science. In this brief communication, the same academics were re-ranked,
using other popular citation databases. It was found that for academics who publish more in computer
science forums, their h was significantly different due to highly cited papers missed by Web of Science;
consequently their rank changed substantially. The study was widened to a broader set of UK LIS and IR
academics where results showed similar statistically significant differences. A variant of h, hmx, was
introduced that allowed a ranking of the academics using all citation databases together
Duplicate Detection in the Reuters Collection
While conducting some experiments with the Reuters collection, it was discovered
that contained within it were a number of documents that were exact duplicates of
each other (see Figure 1). A short study was conducted to try to discover how many
such documents there were. The results of this study revealed that the notion of a
duplicate document was not as simple as first thought.
The contents of this report are as follows. A brief review of previous duplicate detection
research will be presented, followed by a description of the methods and results of
the duplicate detection work conducted here. In addition, there is an appendix holding
the document ids of the various types of duplicate found
- …