57,381 research outputs found

    Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

    Get PDF
    Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data

    Temporal Feedback for Tweet Search with Non-Parametric Density Estimation

    Get PDF
    This paper investigates the temporal cluster hypothesis: in search tasks where time plays an important role, do relevant documents tend to cluster together in time? We explore this question in the context of tweet search and temporal feedback: starting with an initial set of results from a baseline retrieval model, we estimate the temporal density of relevant documents, which is then used for result reranking. Our contributions lie in a method to characterize this temporal density function using kernel density estimation, with and without human relevance judgments, and an approach to integrating this information into a standard retrieval model. Experiments on TREC datasets confirm that our temporal feedback formulation improves search effectiveness, thus providing support for our hypothesis. Our approach outperforms both a standard baseline and previous temporal retrieval models. Temporal feedback improves over standard lexical feedback (with and without human judgments), illustrating that temporal relevance signals exist independently of document content

    DCU at the NTCIR-12 SpokenQuery&Doc-2 task

    Get PDF
    We describe DCU’s participation in the NTCIR-12 SpokenQuery&Doc (SQD-2) task. In the context of the slide-group retrieval sub-task, we experiment with a passage retrieval method that re-scores each passage according to the relevance score of the document from which the passage is taken. This is performed by linearly interpolating their relevance scores which are calculated using the Okapi BM25 model of probabilistic retrieval for passages and documents independently. In conjunction with this, we assess the benefits of using pseudo-relevance feedback for expanding the textual representation of the spoken queries with terms found in the top-ranked documents and passages, and experiment with a general multidimensional optimisation method to jointly tune the BM25 and query expansion parameters with queries and relevance data from the NTCIR-11 SQD-1 task. Retrieval experiments performed over the SQD-1 and SQD-2 queries confirm previous findings which affirm that integrating document information when ranking passages can lead to improved passage retrieval effectiveness. Furthermore, results indicate that no significant gains in retrieval effectiveness can be obtained by using query expansion in combination with our retrieval models over these two query sets
    • …
    corecore