3 research outputs found
On the Additivity and Weak Baselines for Search Result Diversification Research
A recent study on the topic of additivity addresses the task of search result diversification and concludes that while weaker baselines are almost always significantly improved by the evaluated diversification methods, for stronger baselines, just the opposite happens, i.e., no significant improvement can be observed. Due to the importance of the issue in shaping future research directions and evaluation strategies in search results diversification, in this work, we first aim to reproduce the findings reported in the previous study, and then investigate its possible limitations. Our extensive experiments first reveal that under the same experimental setting with that previous study, we can reach similar results. Next, we hypothesize that for stronger baselines, tuning the parameters of some methods (i.e., the trade-off parameter between the relevance and diversity of the results in this particular scenario) should be done in a more fine-grained manner. With trade-off parameters that are specifically determined for each baseline run, we show that the percentage of significant improvements even over the strong baselines can be doubled. As a further issue, we discuss the possible impact of using the same strong baseline retrieval function for the diversity computations of the methods. Our takeaway message is that in the case of a strong baseline, it is more crucial to tune the parameters of the diversification methods to be evaluated; but once this is done, additivity is achievable
Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models
Is neural IR mostly hype? In a recent SIGIR Forum article, Lin expressed
skepticism that neural ranking models were actually improving ad hoc retrieval
effectiveness in limited data scenarios. He provided anecdotal evidence that
authors of neural IR papers demonstrate "wins" by comparing against weak
baselines. This paper provides a rigorous evaluation of those claims in two
ways: First, we conducted a meta-analysis of papers that have reported
experimental results on the TREC Robust04 test collection. We do not find
evidence of an upward trend in effectiveness over time. In fact, the best
reported results are from a decade ago and no recent neural approach comes
close. Second, we applied five recent neural models to rerank the strong
baselines that Lin used to make his arguments. A significant improvement was
observed for one of the models, demonstrating additivity in gains. While there
appears to be merit to neural IR approaches, at least some of the gains
reported in the literature appear illusory.Comment: Published in the Proceedings of the 42nd Annual International ACM
SIGIR Conference on Research and Development in Information Retrieval (SIGIR
2019
Supervised approaches for explicit search result diversification
Diversification of web search results aims to promote documents with diverse content (i.e., covering different aspects of a query) to the top-ranked positions, to satisfy more users, enhance fairness and reduce bias. In this work, we focus on the explicit diversification methods, which assume that the query aspects are known at the diversification time, and leverage supervised learning methods to improve their performance in three different frameworks with different features and goals. First, in the LTRDiv framework, we focus on applying typical learning to rank (LTR) algorithms to obtain a ranking where each top-ranked document covers as many aspects as possible. We argue that such rankings optimize various diversification metrics (under certain assumptions), and hence, are likely to achieve diversity in practice. Second, in the AspectRanker framework, we apply LTR for ranking the aspects of a query with the goal of more accurately setting the aspect importance values for diversification. As features, we exploit several pre- and post-retrieval query performance predictors (QPPs) to estimate how well a given aspect is covered among the candidate documents. Finally, in the LmDiv framework, we cast the diversification problem into an alternative fusion task, namely, the supervised merging of rankings per query aspect. We again use QPPs computed over the candidate set for each aspect, and optimize an objective function that is tailored for the diversification goal. We conduct thorough comparative experiments using both the basic systems (based on the well-known BM25 matching function) and the best-performing systems (with more sophisticated retrieval methods) from previous TREC campaigns. Our findings reveal that the proposed frameworks, especially AspectRanker and LmDiv, outperform both non-diversified rankings and two strong diversification baselines (i.e., xQuAD and its variant) in terms of various effectiveness metrics