11,148 research outputs found
Retrieving with good sense
Although always present in text, word sense ambiguity only recently became regarded as a problem to information
retrieval which was potentially solvable. The growth of interest in word senses resulted from new directions taken in
disambiguation research. This paper first outlines this research and surveys the resulting efforts in information
retrieval. Although the majority of attempts to improve retrieval effectiveness were unsuccessful, much was learnt
from the research. Most notably a notion of under what circumstance disambiguation may prove of use to retrieval
Accurate user directed summarization from existing tools
This paper describes a set of experimental
results produced from the TIPSTER
SUMMAC initiative on user directed
summaries: document summaries generated in
the context of an information need expressed
as a query. The summarizer that was
evaluated was based on a set of existing
statistical techniques that had been applied
successfully to the INQUERY retrieval system.
The techniques proved to have a wider utility,
however, as the summarizer was one of the
better performing systems in the SUMMAC
evaluation. The design of this summarizer is
presented with a range of evaluations: both
those provided by SUMMAC as well as a set of
preliminary, more informal, evaluations that
examined additional aspects of the summaries.
Amongst other conclusions, the results reveal
that users can judge the relevance of
documents from their summary almost as
accurately as if they had had access to the
document’s full text
Word sense disambiguation and information retrieval
It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval
(IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will
increase. However, recent research into the application of a word sense disambiguator to an IR system failed
to show any performance increase. From these results it has become clear that more basic research is needed
to investigate the relationship between sense ambiguity, disambiguation, and IR.
Using a technique that introduces additional sense ambiguity into a collection, this paper presents research
that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have
on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an IR system
when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to
be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of
accuracy
Revisiting h measured on UK LIS and IR academics
A brief communication appearing in this journal ranked UK LIS and (some) IR academics by their h-index
using data derived from Web of Science. In this brief communication, the same academics were re-ranked,
using other popular citation databases. It was found that for academics who publish more in computer
science forums, their h was significantly different due to highly cited papers missed by Web of Science;
consequently their rank changed substantially. The study was widened to a broader set of UK LIS and IR
academics where results showed similar statistically significant differences. A variant of h, hmx, was
introduced that allowed a ranking of the academics using all citation databases together
Duplicate Detection in the Reuters Collection
While conducting some experiments with the Reuters collection, it was discovered
that contained within it were a number of documents that were exact duplicates of
each other (see Figure 1). A short study was conducted to try to discover how many
such documents there were. The results of this study revealed that the notion of a
duplicate document was not as simple as first thought.
The contents of this report are as follows. A brief review of previous duplicate detection
research will be presented, followed by a description of the methods and results of
the duplicate detection work conducted here. In addition, there is an appendix holding
the document ids of the various types of duplicate found
Word sense disambiguation and information retrieval
It has often been thought that word sense ambiguity is a cause of poor performance in Information Retrieval
(IR) systems. The belief is that if ambiguous words can be correctly disambiguated, IR performance will
increase. However, recent research into the application of a word sense disambiguator to an IR system failed
to show any performance increase. From these results it has become clear that more basic research is needed
to investigate the relationship between sense ambiguity, disambiguation, and IR.
Using a technique that introduces additional sense ambiguity into a collection, this paper presents research
that goes beyond previous work in this field to reveal the influence that ambiguity and disambiguation have
on a probabilistic IR system. We conclude that word sense ambiguity is only problematic to an IR system
when it is retrieving from very short queries. In addition we argue that if a word sense disambiguator is to
be of any use to an IR system, the disambiguator must be able to resolve word senses to a high degree of
accuracy
The Reuters collection
This short paper presents the little known Reuters 22,173 test collection, which is significantly
larger than most traditional test collections. In addition, Reuters has none of the recall calculation
problems normally associated with some of the larger test collections now available. This paper
explains the method (derived from Lewis [Lewis 91]) used to perform retrieval experiments on the
Reuters collection. Then, to illustrate the use of Reuters, some simple retrieval experiments are
also presented that compare the performance of stemming algorithms
The infinite disk : challenges from no limitations
Challenge:
Managing and searching across multi-terabyte and potentially multi-petabyte personal stores of multimedia
information
Search of spoken documents retrieves well recognized transcripts
This paper presents a series of analyses and experiments on spoken
document retrieval systems: search engines that retrieve transcripts produced by
speech recognizers. Results show that transcripts that match queries well tend to
be recognized more accurately than transcripts that match a query less well.
This result was described in past literature, however, no study or explanation of
the effect has been provided until now. This paper provides such an analysis
showing a relationship between word error rate and query length. The paper
expands on past research by increasing the number of recognitions systems that
are tested as well as showing the effect in an operational speech retrieval
system. Potential future lines of enquiry are also described
Keep It Simple Sheffield – a KISS approach to the Arabic track
Sheffield’s participation in the inaugural Arabic cross language track is described here. Our goal was to
examine how well one could achieve retrieval of Arabic text with the minimum of resources and adaptation
of existing retrieval systems. To this end the public translators used for query translation and the minimal
changes to our retrieval system are described. While the effectiveness of our resulting system is not as high
as one might desire, it nevertheless provides reasonable performance particularly in the monolingual track:
on average, just under four relevant documents were found in the 10 top ranked documents
- …