13 research outputs found

    Incremental Test Collections

    Get PDF
    Corpora and topics are readily available for information retrieval research. Relevance judgments, which are necessary for system evaluation, are expensive; the cost of obtaining them prohibits in-house evaluation of retrieval systems on new corpora or new topics. We present an algorithm for cheaply constructing sets of relevance judgments. Our method intelligently selects documents to be judged and decides when to stop in such a way that with very little work there can be a high degree of condence in the result of the evaluation. We demonstrate the algorithm\u27s eectiveness by showing that it produces small sets of relevance judgments that reliably discriminate between two systems. The algorithm can be used to incrementally design retrieval systems by simultaneously comparing sets of systems. The number of additional judgments needed after each incremental design change decreases at a rate reciprocal to the number of systems being compared. To demonstrate the eectiveness of our method, we evaluate TREC ad hoc submissions, showing that with 95% fewer relevance judgments we can reach a Kendall\u27s tau rank correlation of at least 0.9

    Reducing Reliance on Relevance Judgments for System Comparison by Using Expectation-Maximization

    Full text link

    Search engines that learn from their users

    Get PDF
    More than half the world's population uses web search engines, resulting in over half a billion queries every single day. For many people, web search engines such as Baidu, Bing, Google, and Yandex are among the first resources they go to when a question arises. Moreover, for many search engines have become the most trusted route to information, more so even than traditional media such as newspapers, news websites or news channels on television. What web search engines present people with greatly influences what they believe to be true and consequently it influences their thoughts, opinions, decisions, and the actions they take. With this in mind two things are important, from an information retrieval research perspective. First, it is important to understand how well search engines (rankers) perform and secondly this knowledge should be used to improve them. This thesis is about these two topics: evaluation of search engines and learning search engines. In the first part of this thesis we investigate how user interactions with search engines can be used to evaluate search engines. In particular, we introduce a new online evaluation paradigm called multileaving that extends upon interleaving. With multileaving, many rankers can be compared at once by combining document lists from these rankers into a single result list and attributing user interactions with this list to the rankers. Then we investigate the relation between A/B testing and interleaved comparison methods. Both studies lead to much higher sensitivity of the evaluation methods, meaning that fewer user interactions are required to arrive at reliable conclusions. This has the important implication that fewer users need to be exposed to the results from possibly inferior search engines. In the second part of this thesis we turn to online learning to rank. We learn from the evaluation methods introduced and extended upon in the first part. We learn the parameters of base rankers based on user interactions. Then we use the multileaving methods as feedback in our learning method, leading to much faster convergence than existing methods. Again, the important implication is that fewer users need to be exposed to possibly inferior search engines as they adapt more quickly to changes in user preferences. The last part of this thesis is of a different nature than the earlier two parts. As opposed to the earlier chapters, we no longer study algorithms. Progress in information retrieval research has always been driven by a combination of algorithms, shared resources, and evaluation. In the last part we focus on the latter two. We introduce a new shared resource and a new evaluation paradigm. Firstly, we propose Lerot. Lerot is an online evaluation framework that allows us to simulate users interacting with a search engine. Our implementation has been released as open source software and is currently being used by researchers around the world. Secondly we introduce OpenSearch, a new evaluation paradigm involving real users of real search engines. We describe an implementation of this paradigm that has already been widely adopted by the research community through challenges at CLEF and TREC.</jats:p

    Evaluating Information Retrieval Systems With Multiple Non-Expert Assessors

    Get PDF
    Many current test collections require the use of expert judgments during construction. The true label of each document is given by an expert assessor. However, the cost and effort associated with expert training and judging are typically quite high in the event where we have a high number of documents to judge. One way to address this issue is to have each document judged by multiple non-expert assessors at a lower expense. However, there are two key factors that can make this method difficult: the variability across assessors' judging abilities, and the aggregation of the noisy labels into one single consensus label. Much previous work has shown how to utilize this method to replace expert labels in the relevance evaluation. However, the effects of relevance judgment errors on the ranking system evaluation have been less explored. This thesis mainly investigates how to best evaluate information retrieval systems with noisy labels, where no ground-truth labels are provided, and where each document may receive multiple noisy labels. Based on our simulation results on two datasets, we find that conservative assessors that tend to label incoming documents as non-relevant are preferable. And there are two important factors affect the overall conservativeness of the consensus labels: the assessor's conservativeness and the relevance standard. This important observation essentially provides a guideline on what kind of consensus algorithms or assessors are needed in order to preserve the high correlation with expert labels in ranking system evaluation. Also, we systematically investigate how to find the consensus labels for those documents with equal confidence to be either relevant or non-relevant. We investigate a content-based consensus algorithm which links the noisy labels with document content. We compare it against the state-of-art consensus algorithms, and find that, depending on the document collection, this content-based approach may help or hurt the performance

    Towards Population of Knowledge Bases from Conversational Sources

    Get PDF
    With an increasing amount of data created daily, it is challenging for users to organize and discover information from massive collections of digital content (e.g., text and speech). The population of knowledge bases requires linking information from unstructured sources (e.g., news articles and web pages) to structured external knowledge bases (e.g., Wikipedia), which has the potential to advance information archiving and access, and to support knowledge discovery and reasoning. Because of the complexity of this task, knowledge base population is composed of multiple sub-tasks, including the entity linking task, defined as linking the mention of entities (e.g., persons, organizations, and locations) found in documents to their referents in external knowledge bases and the event task, defined as extracting related information for events that should be entered in the knowledge base. Most prior work on tasks related to knowledge base population has focused on dissemination-oriented sources written in the third person (e.g., new articles) that benefit from two characteristics: the content is written in formal language and is to some degree self-contextualized, and the entities mentioned (e.g., persons) are likely to be widely known to the public so that rich information can be found from existing general knowledge bases (e.g., Wikipedia and DBpedia). The work proposed in this thesis focuses on tasks related to knowledge base population for conversational sources written in the first person (e.g., emails and phone recordings), which offers new challenges. One challenge is that most conversations (e.g., 68% of the person names and 53% of the organization names in Enron emails) refer to entities that are known to the conversational participants but not widely known. Thus, existing entity linking techniques relying on general knowledge bases are not appropriate. Another challenge is that some of the shared context between participants in first-person conversations may be implicit and thus challenging to model, increasing the difficulty, even for human annotators, of identifying the true referents. This thesis focuses on several tasks relating to the population of knowledge bases for conversational content: the population of collection-specific knowledge bases for organization entities and meetings from email collections; the entity linking task that resolves the mention of three types of entities (person, organization, and location) found in both conversational text (emails) and speech (phone recordings) sources to multiple knowledge bases, including a general knowledge base built from Wikipedia and collection-specific knowledge bases; the meeting linking task that links meeting-related email messages to the referenced meeting entries in the collection-specific meeting knowledge base; and speaker identification techniques to improve the entity linking task for phone recordings without known speakers. Following the model-based evaluation paradigm, three collections (namely, Enron emails, Avocado emails, and Enron phone recordings) are used as the representations of conversational sources, new test collections are created for each task, and experiments are conducted for each task to evaluate the efficacy of the proposed methods and to provide a comparison to existing state-of-the-art systems. This work has implications in the research fields of e-discovery, scientific collaboration, speaker identification, speech retrieval, and privacy protection

    Liberal or Conservative: Evaluation and Classification with Distribution as Ground Truth.

    Full text link
    The ability to classify the political leaning of a large number of articles and items is valuable to both academic research and practical applications. The challenge, though, is not only about developing innovative classification algorithms, which constitutes a “classifier” theme in this thesis, but also about how to define the “ground truth” of items’ political leaning, how to elicit labels when labelers do not agree, and how to evaluate classifiers with unreliable labeled data, which constitutes a “ground truth” theme in the thesis. The “ground truth” theme argues for the use of distributions (e.g., 0.6 conservative, 0.4 liberal) instead of labels (e.g, conservative, liberal) as the underlying ground truth of items’ political leaning, where disagreements among labelers are not human errors but rather useful information reflecting the distribution of people’s subjective opinions. Empirical data demonstrate that distributions are dispersed: there are many items upon which labelers simply do not agree. Therefore, mapping distributions into single labels requires more than just majority vote. Also, one can no longer assume the labels from a few labelers are reliable because a different small sample of labelers might yield a very different picture. However, even though individual labeled items are not reliable, simulation suggests that we may still reliably evaluate and rank classifiers, as long as we have a large number of labeled items for evaluation. The optimal way is to obtain one label per item with many items (e.g., 1000~3000) for evaluation. The “classifier” theme proposes the LabelPropagator algorithm that propagates the political leaning of known articles and users to the target nodes in order to classify them. LabelPropagator achieves higher accuracy than the alternative classifiers based on text analysis, suggesting that a relatively small number of labeled people and stories, together with a large number of people to item votes, can be used to classify the other people and items. An article’s source is useful as an input for propagation, while text similarities, users’ friendship, and “href” links to articles are not.PHDInformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/97979/1/mrzhou_1.pd
    corecore