28 research outputs found
Learning to merge search results for efficient Distributed Information Retrieval
Merging search results from different servers is a major problem in Distributed Information Retrieval. We used Regression-SVM and Ranking-SVM which would learn a function that merges results based on information that is readily available: i.e. the ranks, titles, summaries and URLs contained in the results pages. By not downloading additional information, such as the full document, we decrease bandwidth usage. CORI and Round Robin merging were used as our baselines; surprisingly, our results show that the SVM-methods do not improve over those baselines
Search of spoken documents retrieves well recognized transcripts
This paper presents a series of analyses and experiments on spoken
document retrieval systems: search engines that retrieve transcripts produced by
speech recognizers. Results show that transcripts that match queries well tend to
be recognized more accurately than transcripts that match a query less well.
This result was described in past literature, however, no study or explanation of
the effect has been provided until now. This paper provides such an analysis
showing a relationship between word error rate and query length. The paper
expands on past research by increasing the number of recognitions systems that
are tested as well as showing the effect in an operational speech retrieval
system. Potential future lines of enquiry are also described
Examining repetition in user search behavior
This paper describes analyses of the repeated use of search engines.
It is shown that users commonly re-issue queries, either to examine search
results deeply or simply to query again, often days or weeks later. Hourly and
weekly periodicities in behavior are observed for both queries and clicks.
Navigational queries were found to be repeated differently from others
LESIM: A Novel Lexical Similarity Measure Technique for Multimedia Information Retrieval
Metadata-based similarity measurement is far from obsolete in our days, despite research’s focus on content and context. It allows for aggregating information from textual references, measuring similarity when content is not available, traditional keyword search in search engines, merging results in meta-search engines and many more research and industry interesting activities. Existing similarity measures do not take into consideration neither the unique nature of multimedia’s metadata nor the requirements of metadata-based information retrieval of multimedia. This work proposes a customised for the commonly available author-title multimedia metadata hybrid similarity measure that is shown through experimentation to be significantly more effective than baseline measures
A Comparative Analysis of Retrievability and PageRank Measures
The accessibility of documents within a collection holds a pivotal role in
Information Retrieval, signifying the ease of locating specific content in a
collection of documents. This accessibility can be achieved via two distinct
avenues. The first is through some retrieval model using a keyword or other
feature-based search, and the other is where a document can be navigated using
links associated with them, if available. Metrics such as PageRank, Hub, and
Authority illuminate the pathways through which documents can be discovered
within the network of content while the concept of Retrievability is used to
quantify the ease with which a document can be found by a retrieval model. In
this paper, we compare these two perspectives, PageRank and retrievability, as
they quantify the importance and discoverability of content in a corpus.
Through empirical experimentation on benchmark datasets, we demonstrate a
subtle similarity between retrievability and PageRank particularly
distinguishable for larger datasets.Comment: Accepted at FIRE 202
Two-Step Active Learning for Instance Segmentation with Uncertainty and Diversity Sampling
Training high-quality instance segmentation models requires an abundance of
labeled images with instance masks and classifications, which is often
expensive to procure. Active learning addresses this challenge by striving for
optimum performance with minimal labeling cost by selecting the most
informative and representative images for labeling. Despite its potential,
active learning has been less explored in instance segmentation compared to
other tasks like image classification, which require less labeling. In this
study, we propose a post-hoc active learning algorithm that integrates
uncertainty-based sampling with diversity-based sampling. Our proposed
algorithm is not only simple and easy to implement, but it also delivers
superior performance on various datasets. Its practical application is
demonstrated on a real-world overhead imagery dataset, where it increases the
labeling efficiency fivefold.Comment: UNCV ICCV 202
Learning to Choose : automatic Selection of the Information Retrieval Parameters
International audienceIn this paper we promote a selective information retrieval process to be applied in the context of repeated queries. The method is based on a training phase in which the meta search system learns the best parameters to use on a per query basis. The training phase uses a sample of annotated documents for which document relevance is known. When an equal-query is submitted to the system, it automatically knows which parameters it should use to treat the query. This Learning to choose method is evaluated using simulated data from TREC campaigns. We show that system performance highly increases in terms of precision (MAP), speci cally for the queries that are di cult to answer, when compared to any unique system con guration applied to all the queries