507 research outputs found
Automatic summarising: factors and directions
This position paper suggests that progress with automatic summarising demands
a better research methodology and a carefully focussed research strategy. In
order to develop effective procedures it is necessary to identify and respond
to the context factors, i.e. input, purpose, and output factors, that bear on
summarising and its evaluation. The paper analyses and illustrates these
factors and their implications for evaluation. It then argues that this
analysis, together with the state of the art and the intrinsic difficulty of
summarising, imply a nearer-term strategy concentrating on shallow, but not
surface, text analysis and on indicative summarising. This is illustrated with
current work, from which a potentially productive research programme can be
developed
Recommended from our members
Ontology Based Query Expansion with a Probabilistic Retrieval Model
This paper examines the use of ontologies for defining query context. The information retrieval system used is based on the probabilistic retrieval model. We extend the use of relevance feedback (RFB) and pseudo-relevance feedback (PF) query expansion techniques using information from a news domain ontology. The aim is to assess the impact of the ontology on the query expansion results with respect to recall and precision. We also tested the results for varying the relevance feedback parameters (number of terms or number of documents). The factors which influence the success of ontology based query expansion are outlined. Our findings show that ontology based query expansion has had mixed success. The use of the ontology has vastly increased the number of relevant documents retrieved, however, we conclude that for both types of query expansion, the PF results are better than the RFB results
Unbiased Comparative Evaluation of Ranking Functions
Eliciting relevance judgments for ranking evaluation is labor-intensive and
costly, motivating careful selection of which documents to judge. Unlike
traditional approaches that make this selection deterministically,
probabilistic sampling has shown intriguing promise since it enables the design
of estimators that are provably unbiased even when reusing data with missing
judgments. In this paper, we first unify and extend these sampling approaches
by viewing the evaluation problem as a Monte Carlo estimation task that applies
to a large number of common IR metrics. Drawing on the theoretical clarity that
this view offers, we tackle three practical evaluation scenarios: comparing two
systems, comparing systems against a baseline, and ranking systems. For
each scenario, we derive an estimator and a variance-optimizing sampling
distribution while retaining the strengths of sampling-based evaluation,
including unbiasedness, reusability despite missing data, and ease of use in
practice. In addition to the theoretical contribution, we empirically evaluate
our methods against previously used sampling heuristics and find that they
generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page
Combining global and local semantic contexts for improving biomedical information retrieval
Présenté lors de l'European Conference on Information Retrieval 2011International audienceIn the context of biomedical information retrieval (IR), this paper explores the relationship between the document's global context and the query's local context in an attempt to overcome the term mismatch problem between the user query and documents in the collection. Most solutions to this problem have been focused on expanding the query by discovering its context, either \textit{global} or \textit{local}. In a global strategy, all documents in the collection are used to examine word occurrences and relationships in the corpus as a whole, and use this information to expand the original query. In a local strategy, the top-ranked documents retrieved for a given query are examined to determine terms for query expansion. We propose to combine the document's global context and the query's local context in an attempt to increase the term overlap between the user query and documents in the collection via document expansion (DE) and query expansion (QE). The DE technique is based on a statistical method (IR-based) to extract the most appropriate concepts (global context) from each document. The QE technique is based on a blind feedback approach using the top-ranked documents (local context) obtained in the first retrieval stage. A comparative experiment on the TREC 2004 Genomics collection demonstrates that the combination of the document's global context and the query's local context shows a significant improvement over the baseline. The MAP is significantly raised from 0.4097 to 0.4532 with a significant improvement rate of +10.62\% over the baseline. The IR performance of the combined method in terms of MAP is also superior to official runs participated in TREC 2004 Genomics and is comparable to the performance of the best run (0.4075)
Non-Compositional Term Dependence for Information Retrieval
Modelling term dependence in IR aims to identify co-occurring terms that are
too heavily dependent on each other to be treated as a bag of words, and to
adapt the indexing and ranking accordingly. Dependent terms are predominantly
identified using lexical frequency statistics, assuming that (a) if terms
co-occur often enough in some corpus, they are semantically dependent; (b) the
more often they co-occur, the more semantically dependent they are. This
assumption is not always correct: the frequency of co-occurring terms can be
separate from the strength of their semantic dependence. E.g. "red tape" might
be overall less frequent than "tape measure" in some corpus, but this does not
mean that "red"+"tape" are less dependent than "tape"+"measure". This is
especially the case for non-compositional phrases, i.e. phrases whose meaning
cannot be composed from the individual meanings of their terms (such as the
phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction
between the frequency and strength of term dependence in IR, we present a
principled approach for handling term dependence in queries, using both lexical
frequency and semantic evidence. We focus on non-compositional phrases,
extending a recent unsupervised model for their detection [21] to IR. Our
approach, integrated into ranking using Markov Random Fields [31], yields
effectiveness gains over competitive TREC baselines, showing that there is
still room for improvement in the very well-studied area of term dependence in
IR
A Latent Dirichlet Framework for Relevance Modeling
Abstract. Relevance-based language models operate by estimating the probabilities of observing words in documents relevant (or pseudo relevant) to a topic. However, these models assume that if a document is relevant to a topic, then all tokens in the document are relevant to that topic. This could limit model robustness and effectiveness. In this study, we propose a Latent Dirichlet relevance model, which relaxes this assumption. Our approach derives from current research on Latent Dirichlet Allocation (LDA) topic models. LDA has been extensively explored, especially for generating a set of topics from a corpus. A key attraction is that in LDA a document may be about several topics. LDA itself, however, has a limitation that is also addressed in our work. Topics generated by LDA from a corpus are synthetic, i.e., they do not necessarily correspond to topics identified by humans for the same corpus. In contrast, our model explicitly considers the relevance relationships between documents and given topics (queries). Thus unlike standard LDA, our model is directly applicable to goals such as relevance feedback for query modification and text classification, where topics (classes and queries) are provided upfront. Thus although the focus of our paper is on improving relevance-based language models, in effect our approach bridges relevance-based language models and LDA addressing limitations of both. Finally, we propose an idea that takes advantage of “bagof-words” assumption to reduce the complexity of Gibbs sampling based learning algorithm
Indexing, browsing and searching of digital video
Video is a communications medium that normally brings together moving pictures with a synchronised audio track into a discrete piece or pieces of information. The size of a “piece ” of video can variously be referred to as a frame, a shot, a scene, a clip, a programme or an episode, and these are distinguished by their lengths and by their composition. We shall return to the definition of each of these in section 4 this chapter. In modern society, video is ver
- …
