11 research outputs found

    Understanding and Predicting Characteristics of Test Collections in Information Retrieval

    Full text link
    Research community evaluations in information retrieval, such as NIST's Text REtrieval Conference (TREC), build reusable test collections by pooling document rankings submitted by many teams. Naturally, the quality of the resulting test collection thus greatly depends on the number of participating teams and the quality of their submitted runs. In this work, we investigate: i) how the number of participants, coupled with other factors, affects the quality of a test collection; and ii) whether the quality of a test collection can be inferred prior to collecting relevance judgments from human assessors. Experiments conducted on six TREC collections illustrate how the number of teams interacts with various other factors to influence the resulting quality of test collections. We also show that the reusability of a test collection can be predicted with high accuracy when the same document collection is used for successive years in an evaluation campaign, as is common in TREC.Comment: Accepted as a full paper at iConference 202

    Compare statistical significance tests for information retrieval evaluation

    Get PDF
    Preprint of our Journal of the Association for Information Science and Technology (JASIST) paper[Abstract] Statistical significance tests can provide evidence that the observed difference in performance between two methods is not due to chance. In Information Retrieval, some studies have examined the validity and suitability of such tests for comparing search systems.We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in Information Retrieval evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t-test. The sign test and Wilcoxon signed test also have a good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.Ministerio de Econom´ıa y Competitividad; TIN2015-64282-RXunta de Galicia; GPC 2016/035Xunta de Galicia; ED431G/01Xunta de Galicia; ED431G/0

    When to stop making relevance judgments? A study of stopping methods for building information retrieval test collections

    Get PDF
    This is the peer reviewed version of the following article: David E. Losada, Javier Parapar and Alvaro Barreiro (2019) When to Stop Making Relevance Judgments? A Study of Stopping Methods for Building Information Retrieval Test Collections. Journal of the Association for Information Science and Technology, 70 (1), 49-60, which has been published in final form at https://doi.org/10.1002/asi.24077. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived VersionsIn information retrieval evaluation, pooling is a well‐known technique to extract a sample of documents to be assessed for relevance. Given the pooled documents, a number of studies have proposed different prioritization methods to adjudicate documents for judgment. These methods follow different strategies to reduce the assessment effort. However, there is no clear guidance on how many relevance judgments are required for creating a reliable test collection. In this article we investigate and further develop methods to determine when to stop making relevance judgments. We propose a highly diversified set of stopping methods and provide a comprehensive analysis of the usefulness of the resulting test collections. Some of the stopping methods introduced here combine innovative estimates of recall with time series models used in Financial Trading. Experimental results on several representative collections show that some stopping methods can reduce up to 95% of the assessment effort and still produce a robust test collection. We demonstrate that the reduced set of judgments can be reliably employed to compare search systems using disparate effectiveness metrics such as Average Precision, NDCG, P@100, and Rank Biased Precision. With all these measures, the correlations found between full pool rankings and reduced pool rankings is very highThis work received financial support from the (i) “Ministerio de Economía y Competitividad” of the Government of Spain and FEDER Funds under the researchproject TIN2015-64282-R, (ii) Xunta de Galicia (project GPC 2016/035), and (iii) Xunta de Galicia “Consellería deCultura, Educación e Ordenación Universitaria” and theEuropean Regional Development Fund (ERDF) throughthe following 2016–2019 accreditations: ED431G/01(“Centro singular de investigación de Galicia”) andED431G/08S

    Statistical reform in information retrieval

    Get PDF
    Abstract IR revolves around evaluation. Therefore, IR researchers should employ sound evaluation practices. Nowadays many of us know that statistical significance testing is not enough, but not all of us know exactly what to do about it. This paper provides suggestions on how to report effect sizes and confidence intervals along with p-values, in the context of comparing IR systems using test collections. Hopefully, these practices will make IR papers more informative, and help researchers form more reliable conclusions that "add up." Finally, I pose a specific question for the IR community: should IR journal editors and SIGIR PC chairs require (rather than encourage) reporting of effect sizes and confidence intervals

    How Many Crowd Workers Do I Need? On Statistical Power When Crowdsourcing Relevance Judgments

    Get PDF
    To scale the size of Information Retrieval collections, crowdsourcing has become a common way to collect relevance judgments at scale. Crowdsourcing experiments usually employ 100-10,000 workers, but such a number is often decided in a heuristic way. The downside is that the resulting dataset does not have any guarantee of meeting predefined statistical requirements as, for example, have enough statistical power to be able to distinguish in a statistically significant way between the relevance of two documents. We propose a methodology adapted from literature on sound topic set size design, based on t-test and ANOVA, which aims at guaranteeing the resulting dataset to meet a predefined set of statistical requirements. We validate our approach on several public datasets. Our results show that we can reliably estimate the recommended number of workers needed to achieve statistical power, and that such estimation is dependent on the topic, while the effect of the relevance scale is limited. Furthermore, we found that such estimation is dependent on worker features such as agreement. Finally, we describe a set of practical estimation strategies that can be used to estimate the worker set size, and we also provide results on the estimation of document set sizes

    Improving single document summarization in a multi-document environment

    Get PDF
    Most automatic document summarization tools produce summaries from single or multiple document environments. Recent works have shown that there are possibilities to combine both systems: when summarising a single document, its related documents can be found. These documents might have similar knowledge and contain beneficial information in regard to the topic of the single document. Therefore, the summary produced will have sentences extracted from the local (single) document and make use of the additional knowledge from its surrounding (multi-) documents. This thesis will discuss the methodology and experiments to build a generic and extractive summary for a single document that includes information from its neighbourhood documents. We also examine the evaluation and configuration of such systems. There are three contributions of our work. First, we explore the robustness of the Affinity Graph algorithm to generate a summary for a local document. This experiment focused on two main tasks: using different means to identify the related documents, and to summarize the local document by including the information from the related documents. We showed that our findings supported the previous work on document summarization using the Affinity Graph. However, contrary to past suggestions that one configuration of settings was best, we found no particular settings gave better improvements over another. Second, we applied the Affinity Graph algorithm in a social media environment. Recent work in social media suggests that information from blogs and tweets contain parts of the web document that are considered interesting to the user. We assumed that this information could be used to select important sentences from the web document, and hypothesized that the information would improve the summary of a single document. Third, we compare the summaries generated using the Affinity Graph algorithm in two types of evaluation. The first evaluation is by using ROUGE, a commonly used evaluation tools that measure the number of overlapping words between automated summaries and human-generated summaries. In the second evaluation, we studied the judgement of human users using a crowdsourcing platform. Here, we asked people to choose their judgement and explained their reasons to prefer one summary to another. The results from the ROUGE evaluation did not give significant results due to the small tweet-document dataset used in our experiments. However, our findings on the human judgement evaluation showed that the users are more likely to choose the summaries generated using the expanded tweets compared to summaries generated from the local documents only. We conclude the thesis with a study of the user comments, and discussion on the use of Affinity Graph to improve single document summarization. We also include the discussion of the lessons learnt from the user preference evaluation using crowdsourcing platform

    Evaluation with uncertainty

    Get PDF
    Experimental uncertainty arises as a consequence of: (1) bias (systematic error), and (2) variance in measurements. Popular evaluation techniques only account for the variance due to sampling of experimental units, and assume the other sources of uncertainty can be ignored. For example, only the uncertainty due to sampling of topics (queries) and sampling of training:test datasets is considered in standard information retrieval (IR) and classifier system evaluation respectively. However, incomplete relevance judgements, assessor disagreement, non-deterministic systems, and the measurement bias can also cause uncertainty in these experiments. In this thesis, the impact of other sources of uncertainty on evaluating IR and classification experiments are investigated. The uncertainty due to:(1) incomplete relevance judgements in IR test collections,(2) non-determinism in IR systems / classifiers, and (3) high variance of classifiers is analysed using case studies from distributed information retrieval and information security. The thesis illustrates the importance of reducing and accurately accounting for uncertainty when evaluating complex IR and classifier systems. Novel techniques to(1) reduce uncertainty due to test collection bias in IR evaluation and high classifier variance (overfitting) in detecting drive-by download attacks,(2) account for multidimensional variance due to sampling of IR systems instances from non-deterministic IR systems in addition to sampling of topics, and (3) account for repeated measurements due to non-deterministic classification algorithms are introduced

    Identification of re-finding tasks and search difficulty

    Get PDF
    We address the problem of identifying if users are attempting to re-find information and estimating the level of difficulty of the re-finding task. Identifying re-finding tasks and detecting search difficulties will enable search engines to respond dynamically to the search task being undertaken. To this aim, we conduct user studies and query log analysis to make a better understanding of re-finding tasks and search difficulties. Computing features particularly gathered in our user studies, we generate training sets from query log data, which is used for constructing automatic identification (prediction) models. Using machine learning techniques, our built re-finding identification model, which is the first model at the task level, could significantly outperform the existing query-based identifications. While past research assumes that previous search history of the user is available to the prediction model, we examine if re-finding detection is possible without access to this information. Our evaluation indicates that such detection is possible, but more challenging. We further describe the first predictive model in detecting re-finding difficulty, showing it to be significantly better than existing approaches for detecting general search difficulty. We also analyze important features for both identifications of re-finding and difficulties. Next, we investigate detailed identification of re-finding tasks and difficulties in terms of the type of the vertical document to be re-found. The accuracy of constructed predictive models indicates that re-finding tasks are indeed distinguishable across verticals and in comparison to general search tasks. This illustrates the requirement of adapting existing general search techniques for the re-finding context in terms of presenting vertical-specific results. Despite the overall reduction of accuracy in predictions independent of the original search of the user, it appears that identifying “image re-finding” is less dependent on such past information. Investigating the real-time prediction effectiveness of the models show that predicting ``image'' document re-finding obtains the highest accuracy early in the search. Early predictions would benefit search engines with adaptation of search results during re-finding activities. Furthermore, we study the difficulties in re-finding across verticals given some of the established indications of difficulties in the general web search context. In terms of user effort, re-finding “image” vertical appears to take more effort in terms of number of queries and clicks than other investigated verticals, while re-finding “reference” documents seems to be more time consuming when there is a longer time gap between the re-finding and corresponding original search. Exploring other features suggests that there could be particular difficulty indications for the re-finding context and specific to each vertical. To sum up, this research investigates the issue of effectively supporting users with re-finding search tasks. To this end, we have identified features that allow for more accurate distinction between re-finding and general tasks. This will enable search engines to better adapt search results for the re-finding context and improve the search experience of the users. Moreover, features indicative of similar/different and easy/difficult re-finding tasks can be employed for building balanced test environments, which could address one of the main gaps in the re-finding context

    Efficient and effective retrieval using Higher-Order proximity models

    Get PDF
    Information Retrieval systems are widely used to retrieve documents that are relevant to a user's information need. Systems leveraging proximity heuristics to estimate the relevance of a document have shown to be effective. However, the computational cost of proximity-based models is rarely considered, which is an important concern over large-scale document collections. The large-scale collections also make collection-based evaluation challenging since only a small number of documents are judged given the limited budget. Effectiveness, efficiency and reliable evaluation are coherent components that should be considered when developing a good retrieval system.This thesis makes several contributions from the three aspects. Many proximity-based retrieval models are effective, but it is also important to find efficient solutions to extract proximity features, especially for models using higher-order proximity statistics. We therefore propose a one-pass algorithm based on the PlaneSweep approach. We demonstrate that the new one-pass algorithm reduces the cost of capturing a full dependency relation of a query, regardless of the input representations. Although our proposed methods can capture higher-ordered proximity features efficiently, the trade-offs between effectiveness and efficiency when using proximity-based models remains largely unexplored. We consider different variants of proximity statistics and demonstrate that using local proximity statistics can achieve an improved trade-off between effectiveness and efficiency. Another important aspect in IR is reliable system comparisons. We conduct a series of experiments that explore the interaction between pooling and evaluation depth, interactions between evaluation metrics and evaluation depth and also correlations between two different evaluation metrics. We show that different evaluation configurations on large test collections, where only a limited number of relevance labels are available, can lead to different system comparison conclusions. We also demonstrate the pitfalls of choosing an arbitrary evaluation depth regardless of the metrics employed and the pooling depth of the test collections. Lastly, we provide suggestions on the evaluation configurations for the reliable comparisons of retrieval systems on large test collections. On these large test collections, a shallow judgment pool may be employed as assumed budgets are often limited, which may lead to an imprecise evaluation of system performance, especially when a deep evaluation metric is used. We propose an estimation framework for estimating deep metric score on shallow judgment pools. With an initial shallow judgment pool, rank-level estimators are designed to estimate the effectiveness gain at each ranking. Based on the rank-level estimations, we propose an optimization framework to obtain a more precise score estimate
    corecore