112,566 research outputs found

    A Preference Judgment Interface for Authoritative Assessment

    Get PDF
    For offline evaluation of information retrieval systems, preference judgments have been demonstrated to be a superior alternative to graded or binary relevance judgments. In contrast to graded judgments, where each document is assigned to a pre-defined grade level, with preference judgments, assessors judge a pair of items presented side by side, indicating which is better. Unfortunately, preference judgments may require a larger number of judgments, even under an assumption of transitivity. Until recently they also lacked well-established evaluation measures. Previous studies have explored various evaluation measures and proposed different approaches to address the perceived shortcomings of preference judgments. These studies focused on crowdsourced preference judgments, where assessors may lack the training and time to make careful judgments. They did not consider the case where assessors have been trained and provided with the time to carefully consider differences between items. For offline evaluation of information retrieval systems, preference judgments have been demonstrated to be a superior alternative to graded or binary relevance judgments. In contrast to graded judgments, where each document is assigned to a pre-defined grade level, with preference judgments, assessors judge a pair of items presented side by side, indicating which is better. Unfortunately, preference judgments may require a larger number of judgments, even under an assumption of transitivity. Until recently they also lacked well-established evaluation measures. Previous studies have explored various evaluation measures and proposed different approaches to address the perceived shortcomings of preference judgments. These studies focused on crowdsourced preference judgments, where assessors may lack the training and time to make careful judgments. They did not consider the case where assessors have been trained and provided with the time to carefully consider differences between items. We review the literature in terms of algorithms and strategies for extracting preference judgment, evaluation metrics, interface design, and use of crowdsourcing. In this thesis, we design and build a new framework for preference judgment called JUDGO, with various components designed for expert reviewers and researchers. We also suggested a new heap-like preference judgment algorithm that assumes transitivity and tolerates ties. With the help of our framework, NIST assessors found the top-10 best items of each 38 topics for TREC 2022 Health Misinformation Track, with more than 2,200 judgments collected. Our analysis shows that assessors frequently use the search box feature, which enables them to highlight their own keywords in documents, but they are less interested in highlighting documents with the mouse. As a result of additional feedback, we make some modifications to the initially proposed algorithm method and highlighting features

    Information Retrieval Evaluation Measures Based on Preference Graphs

    Get PDF
    Offline evaluation for web search has used mostly graded judgments to evaluate the performance of information retrieval systems. While graded judgments suffer several known problems, preference judgments simply judge one item over another, which avoids the problem of complex definition of relevance scores. Previous research about evaluation measures for preference judgments focuses on translating preferences into relevance scores applied in the traditional evaluation measures, or weighting and counting the number of agreements between actual ranking from users’ preferences and ideal ranking generated by systems. However, these measures lack clear theoretical foundations and their values have no obvious interpretation. On the other hand, although preference judgments for general web search have been studied extensively, there is limited research on investigating preference judgments application for web image search. This thesis addresses exactly these questions, which proposes a preference-based evaluation measure to compute the maximum similarity between an actual ranking from users’ preferences and an ideal ranking generated by systems. Specifically, this measure constructs a directed multigraph and computes the ordering of vertices, which we call the ideal ranking, that has maximum similarity to actual ranking calculated by the rank similarity measure. This measure is able to take any arbitrary collection of preferences that might include the property of conflicts, redundancies, incompleteness, and diverse type results (documents or images). Our results show that Greedy PGC matches or exceeds the performance of evaluation measures proposed in previous research

    Discourse Structure in Machine Translation Evaluation

    Full text link
    In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment- and at the system-level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular we show that: (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference tree is positively correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse analysis. Computational Linguistics, 201

    A Study of Realtime Summarization Metrics

    Get PDF
    Unexpected news events, such as natural disasters or other human tragedies, create a large volume of dynamic text data from official news media as well as less formal social media. Automatic real-time text summarization has become an important tool for quickly transforming this overabundance of text into clear, useful information for end-users including affected individuals, crisis responders, and interested third parties. Despite the importance of real-time summarization systems, their evaluation is not well understood as classic methods for text summarization are inappropriate for real-time and streaming conditions. The TREC 2013-2015 Temporal Summarization (TREC-TS) track was one of the first evaluation campaigns to tackle the challenges of real-time summarization evaluation, introducing new metrics, ground-truth generation methodology and dataset. In this paper, we present a study of TREC-TS track evaluation methodology, with the aim of documenting its design, analyzing its effectiveness, as well as identifying improvements and best practices for the evaluation of temporal summarization systems

    Query Chains: Learning to Rank from Implicit Feedback

    Full text link
    This paper presents a novel approach for using clickthrough data to learn ranked retrieval functions for web search results. We observe that users searching the web often perform a sequence, or chain, of queries with a similar information need. Using query chains, we generate new types of preference judgments from search engine logs, thus taking advantage of user intelligence in reformulating queries. To validate our method we perform a controlled user study comparing generated preference judgments to explicit relevance judgments. We also implemented a real-world search engine to test our approach, using a modified ranking SVM to learn an improved ranking function from preference data. Our results demonstrate significant improvements in the ranking given by the search engine. The learned rankings outperform both a static ranking function, as well as one trained without considering query chains.Comment: 10 page
    • …
    corecore