64,828 research outputs found

    Precision-Recall Curves Using Information Divergence Frontiers

    Get PDF
    Despite the tremendous progress in the estimation of generative models, the development of tools for diagnosing their failures and assessing their performance has advanced at a much slower pace. Recent developments have investigated metrics that quantify which parts of the true distribution is modeled well, and, on the contrary, what the model fails to capture, akin to precision and recall in information retrieval. In this paper, we present a general evaluation framework for generative models that measures the trade-off between precision and recall using R\'enyi divergences. Our framework provides a novel perspective on existing techniques and extends them to more general domains. As a key advantage, this formulation encompasses both continuous and discrete models and allows for the design of efficient algorithms that do not have to quantize the data. We further analyze the biases of the approximations used in practice.Comment: Updated to the AISTATS 2020 versio

    Search under uncertainty: Cognitive biases and heuristics : a tutorial on testing, mitigating and accounting for cognitive biases in search experiments

    Get PDF
    Understanding how people interact with search interfaces is core to the field of Interactive Information Retrieval (IIR). While various models have been proposed (e.g., Belkin's ASK, Berry picking, Everyday-life information seeking, Information foraging theory, Economic theory, etc.), they have largely ignored the impact of cognitive biases on search behaviour and performance. A growing body of empirical work exploring how people's cognitive biases influence search and judgments, has led to the development of new models of search that draw upon Behavioural Economics and Psychology. This full day tutorial will provide a starting point for researchers seeking to learn more about information seeking, search and retrieval under uncertainty. The tutorial will be structured into three parts. First, we will provide an introduction of the biases and heuristics program put forward by Tversky and Kahneman [60] (1974) which assumes that people are not always rational. The second part of the tutorial will provide an overview of the types and space of biases in search,[5, 40] before doing a deep dive into several specific examples and the impact of biases on different types of decisions (e.g., health/medical, financial). The third part will focus on a discussion of the practical implication regarding the design and evaluation human-centered IR systems in the light of cognitive biases - where participants will undertake some hands-on exercises

    Macro-Average: Rare Types Are Important Too

    Full text link
    While traditional corpus-level evaluation metrics for machine translation (MT) correlate well with fluency, they struggle to reflect adequacy. Model-based MT metrics trained on segment-level human judgments have emerged as an attractive replacement due to strong correlation results. These models, however, require potentially expensive re-training for new domains and languages. Furthermore, their decisions are inherently non-transparent and appear to reflect unwelcome biases. We explore the simple type-based classifier metric, MacroF1, and study its applicability to MT evaluation. We find that MacroF1 is competitive on direct assessment, and outperforms others in indicating downstream cross-lingual information retrieval task performance. Further, we show that MacroF1 can be used to effectively compare supervised and unsupervised neural machine translation, and reveal significant qualitative differences in the methods' outputs

    Global-Liar: Factuality of LLMs over Time and Geographic Regions

    Full text link
    The increasing reliance on AI-driven solutions, particularly Large Language Models (LLMs) like the GPT series, for information retrieval highlights the critical need for their factuality and fairness, especially amidst the rampant spread of misinformation and disinformation online. Our study evaluates the factual accuracy, stability, and biases in widely adopted GPT models, including GPT-3.5 and GPT-4, contributing to reliability and integrity of AI-mediated information dissemination. We introduce 'Global-Liar,' a dataset uniquely balanced in terms of geographic and temporal representation, facilitating a more nuanced evaluation of LLM biases. Our analysis reveals that newer iterations of GPT models do not always equate to improved performance. Notably, the GPT-4 version from March demonstrates higher factual accuracy than its subsequent June release. Furthermore, a concerning bias is observed, privileging statements from the Global North over the Global South, thus potentially exacerbating existing informational inequities. Regions such as Africa and the Middle East are at a disadvantage, with much lower factual accuracy. The performance fluctuations over time suggest that model updates may not consistently benefit all regions equally. Our study also offers insights into the impact of various LLM configuration settings, such as binary decision forcing, model re-runs and temperature, on model's factuality. Models constrained to binary (true/false) choices exhibit reduced factuality compared to those allowing an 'unclear' option. Single inference at a low temperature setting matches the reliability of majority voting across various configurations. The insights gained highlight the need for culturally diverse and geographically inclusive model training and evaluation. This approach is key to achieving global equity in technology, distributing AI benefits fairly worldwide.Comment: 24 pages, 12 figures, 9 table

    Do Neural Ranking Models Intensify Gender Bias?

    Full text link
    Concerns regarding the footprint of societal biases in information retrieval (IR) systems have been raised in several previous studies. In this work, we examine various recent IR models from the perspective of the degree of gender bias in their retrieval results. To this end, we first provide a bias measurement framework which includes two metrics to quantify the degree of the unbalanced presence of gender-related concepts in a given IR model's ranking list. To examine IR models by means of the framework, we create a dataset of non-gendered queries, selected by human annotators. Applying these queries to the MS MARCO Passage retrieval collection, we then measure the gender bias of a BM25 model and several recent neural ranking models. The results show that while all models are strongly biased toward male, the neural models, and in particular the ones based on contextualized embedding models, significantly intensify gender bias. Our experiments also show an overall increase in the gender bias of neural models when they exploit transfer learning, namely when they use (already biased) pre-trained embeddings.Comment: In Proceedings of ACM SIGIR 202

    Optimizing Ranking Models in an Online Setting

    Get PDF
    Online Learning to Rank (OLTR) methods optimize ranking models by directly interacting with users, which allows them to be very efficient and responsive. All OLTR methods introduced during the past decade have extended on the original OLTR method: Dueling Bandit Gradient Descent (DBGD). Recently, a fundamentally different approach was introduced with the Pairwise Differentiable Gradient Descent (PDGD) algorithm. To date the only comparisons of the two approaches are limited to simulations with cascading click models and low levels of noise. The main outcome so far is that PDGD converges at higher levels of performance and learns considerably faster than DBGD-based methods. However, the PDGD algorithm assumes cascading user behavior, potentially giving it an unfair advantage. Furthermore, the robustness of both methods to high levels of noise has not been investigated. Therefore, it is unclear whether the reported advantages of PDGD over DBGD generalize to different experimental conditions. In this paper, we investigate whether the previous conclusions about the PDGD and DBGD comparison generalize from ideal to worst-case circumstances. We do so in two ways. First, we compare the theoretical properties of PDGD and DBGD, by taking a critical look at previously proven properties in the context of ranking. Second, we estimate an upper and lower bound on the performance of methods by simulating both ideal user behavior and extremely difficult behavior, i.e., almost-random non-cascading user models. Our findings show that the theoretical bounds of DBGD do not apply to any common ranking model and, furthermore, that the performance of DBGD is substantially worse than PDGD in both ideal and worst-case circumstances. These results reproduce previously published findings about the relative performance of PDGD vs. DBGD and generalize them to extremely noisy and non-cascading circumstances.Comment: European Conference on Information Retrieval (ECIR) 201

    Process-evaluation of tropospheric humidity simulated by general circulation models using water vapor isotopologues: 1. Comparison between models and observations

    Get PDF
    The goal of this study is to determine how H_2O and HDO measurements in water vapor can be used to detect and diagnose biases in the representation of processes controlling tropospheric humidity in atmospheric general circulation models (GCMs). We analyze a large number of isotopic data sets (four satellite, sixteen ground-based remote-sensing, five surface in situ and three aircraft data sets) that are sensitive to different altitudes throughout the free troposphere. Despite significant differences between data sets, we identify some observed HDO/H_2O characteristics that are robust across data sets and that can be used to evaluate models. We evaluate the isotopic GCM LMDZ, accounting for the effects of spatiotemporal sampling and instrument sensitivity. We find that LMDZ reproduces the spatial patterns in the lower and mid troposphere remarkably well. However, it underestimates the amplitude of seasonal variations in isotopic composition at all levels in the subtropics and in midlatitudes, and this bias is consistent across all data sets. LMDZ also underestimates the observed meridional isotopic gradient and the contrast between dry and convective tropical regions compared to satellite data sets. Comparison with six other isotope-enabled GCMs from the SWING2 project shows that biases exhibited by LMDZ are common to all models. The SWING2 GCMs show a very large spread in isotopic behavior that is not obviously related to that of humidity, suggesting water vapor isotopic measurements could be used to expose model shortcomings. In a companion paper, the isotopic differences between models are interpreted in terms of biases in the representation of processes controlling humidity
    • …
    corecore