14 research outputs found

    Nuggeteer: Automatic Nugget-Based Evaluation Using Descriptions and Judgements

    Get PDF
    TREC Definition and Relationship questions are evaluated on thebasis of information nuggets that may be contained in systemresponses. Human evaluators provide informal descriptions of eachnugget, and judgements (assignments of nuggets to responses) for eachresponse submitted by participants.The best present automatic evaluation for these kinds of questions isPourpre. Pourpre uses a stemmed unigram similarity of responses withnugget descriptions, yielding an aggregate result that is difficult tointerpret, but is useful for relative comparison. Nuggeteer, bycontrast, uses both the human descriptions and the human judgements,and makes binary decisions about each response, so that the end resultis as interpretable as the official score.I explore n-gram length, use of judgements, stemming, and termweighting, and provide a new algorithm quantitatively comparable to,and qualitatively better than the state of the art

    An evaluation framework based on gold standard models for definition question answering

    Get PDF
    This paper presents a weak supervised evaluation framework for definition question answering (DefQA) called Solon. It automatically evaluates a set of DefQA systems using existing human definitions as gold standard models. This way it is able to overcome known limitations of the evaluation methods in the state of the art. In addition, Solon assumes that each DefQA task may require a different evaluation configuration, and it is able to automatically find the best one. The results obtained in our experiments show that Solon performs well with respect to the evaluation methods in the state of the art with the advantage that it is less supervised.Postprint (published version

    Assessor Differences and User Preferences in Tweet Timeline Generation

    Full text link
    In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the differ-ence actually meaningful to a user? This paper tackles the last two questions about assessor differences and user prefer-ences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system’s goal is to construct an informative summary of non-redundant tweets that addresses the user’s informa-tion need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substan-tively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Al-though our analyses are limited to this particular task, we believe that lessons learned could generalize to other eval-uations based on establishing semantic equivalence between information units, such as nugget-based evaluations in ques-tion answering and temporal summarization

    Answering clinical questions with knowledge-based and statistical techniques

    Get PDF
    The combination of recent developments in question-answering research and the availability of unparalleled resources developed specifically for automatic semantic processing of text in the medical domain provides a unique opportunity to explore complex question answering in the domain of clinical medicine. This article presents a system designed to satisfy the information needs of physicians practicing evidence-based medicine. We have developed a series of knowledge extractors, which employ a combination of knowledge-based and statistical techniques, for automatically identifying clinically relevant aspects of MEDLINE abstracts. These extracted elements serve as the input to an algorithm that scores the relevance of citations with respect to structured representations of information needs, in accordance with the principles of evidencebased medicine. Starting with an initial list of citations retrieved by PubMed, our system can bring relevant abstracts into higher ranking positions, and from these abstracts generate responses that directly answer physicians ’ questions. We describe three separate evaluations: one focused on the accuracy of the knowledge extractors, one conceptualized as a document reranking task, and finally, an evaluation of answers by two physicians. Experiments on a collection of real-world clinical questions show that our approach significantly outperforms the already competitive PubMed baseline. 1

    On enhancing the robustness of timeline summarization test collections

    Get PDF
    Timeline generation systems are a class of algorithms that produce a sequence of time-ordered sentences or text snippets extracted in real-time from high-volume streams of digital documents (e.g. news articles), focusing on retaining relevant and informative content for a particular information need (e.g. topic or event). These systems have a range of uses, such as producing concise overviews of events for end-users (human or artificial agents). To advance the field of automatic timeline generation, robust and reproducible evaluation methodologies are needed. To this end, several evaluation metrics and labeling methodologies have recently been developed - focusing on information nugget or cluster-based ground truth representations, respectively. These methodologies rely on human assessors manually mapping timeline items (e.g. sentences) to an explicit representation of what information a ‘good’ summary should contain. However, while these evaluation methodologies produce reusable ground truth labels, prior works have reported cases where such evaluations fail to accurately estimate the performance of new timeline generation systems due to label incompleteness. In this paper, we first quantify the extent to which the timeline summarization test collections fail to generalize to new summarization systems, then we propose, evaluate and analyze new automatic solutions to this issue. In particular, using a depooling methodology over 19 systems and across three high-volume datasets, we quantify the degree of system ranking error caused by excluding those systems when labeling. We show that when considering lower-effectiveness systems, the test collections are robust (the likelihood of systems being miss-ranked is low). However, we show that the risk of systems being mis-ranked increases as the effectiveness of systems held-out from the pool increases. To reduce the risk of mis-ranking systems, we also propose a range of different automatic ground truth label expansion techniques. Our results show that the proposed expansion techniques can be effective at increasing the robustness of the TREC-TS test collections, as they are able to generate large numbers missing matches with high accuracy, markedly reducing the number of mis-rankings by up to 50%

    Will Pyramids Built of nUggets Topple Over

    No full text
    The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called “nuggets”. The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between “vital” and “okay ” nuggets. This distinction holds important implications for the TREC scoring model—essentially, systems only receive credit for retrieving vital nuggets—and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our “nugget pyramids ” address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flow.

    Will Pyramids Built of nUggets Topple Over

    No full text
    The present methodology for evaluating complex questions at TREC analyzes answers in terms of facts called “nuggets”. The official F-score metric represents the harmonic mean between recall and precision at the nugget level. There is an implicit assumption that some facts are more important than others, which is implemented in a binary split between “vital ” and “okay ” nuggets. This distinction holds important implications for the TREC scoring model—essentially, systems only receive credit for retrieving vital nuggets—and is a source of evaluation instability. The upshot is that for many questions in the TREC testsets, the median score across all submitted runs is zero. In this work, we introduce a scoring model based on judgments from multiple assessors that captures a more refined notion of nugget importance. We demonstrate on TREC 2003, 2004, and 2005 data that our “nugget pyramids ” address many shortcomings of the present methodology, while introducing only minimal additional overhead on the evaluation flo

    Wikipedia-Based Semantic Enhancements for Information Nugget Retrieval

    Get PDF
    When the objective of an information retrieval task is to return a nugget rather than a document, query terms that exist in a document often will not be used in the most relevant nugget in the document for the query. In this thesis a new method of query expansion is proposed based on the Wikipedia link structure surrounding the most relevant articles selected either automatically or by human assessors for the query. Evaluated with the Nuggeteer automatic scoring software, which we show to have a high correlation with human assessor scores for the ciQA 2006 topics, an increase in the F-scores is found from the TREC Complex Interactive Question Answering task when integrating this expansion into an already high-performing baseline system. In addition, the method for finding synonyms using Wikipedia is evaluated using more common synonym detection tasks

    Design and Evaluation of Temporal Summarization Systems

    Get PDF
    Temporal Summarization (TS) is a new track introduced as part of the Text REtrieval Conference (TREC) in 2013. This track aims to develop systems which can return important updates related to an event over time. In TREC 2013, the TS track specifically used disaster related events such as earthquake, hurricane, bombing, etc. This thesis mainly focuses on building an effective TS system by using a combination of Information Retrieval techniques. The developed TS system returns updates related to disaster related events in a timely manner. By participating in TREC 2013 and with experiments conducted after TREC, we examine the effectiveness of techniques such as distributional similarity for term expansion, which can be employed in building TS systems. Also, this thesis describes the effectiveness of other techniques such as stemming, adaptive sentence selection over time and de-duplication in our system, by comparing it with other baseline systems. The second part of the thesis examines the current methodology used for evaluating TS systems. We propose a modified evaluation method which could reduce the manual effort of assessors, and also correlates well with the official track’s evaluation. We also propose a supervised learning based evaluation method, which correlates well with the official track’s evaluation of systems and could save the assessor’s time by as much as 80%

    Answer Re-ranking with bilingual LDA and social QA forum corpus

    Get PDF
    One of the most important tasks for AI is to find valuable information from the Web. In this research, we develop a question answering system for retrieving answers based on a topic model, bilingual latent Dirichlet allocation (Bi-LDA), and knowledge from social question answering (SQA) forum, such as Yahoo! Answers. Regarding question and answer pairs from a SQA forum as a bilingual corpus, a shared topic over question and answer documents is assigned to each term so that the answer re-ranking system can infer the correlation of terms between questions and answers. A query expansion approach based on the topic model obtains a 9% higher top-150 mean reciprocal rank (MRR@150) and a 16% better geometric mean rank as compared to a simple matching system via Okapi/BM25. In addition, this thesis compares the performance in several experimental settings to clarify the factor of the result
    corecore