1,431 research outputs found

    Design Patterns for Fusion-Based Object Retrieval

    Full text link
    We address the task of ranking objects (such as people, blogs, or verticals) that, unlike documents, do not have direct term-based representations. To be able to match them against keyword queries, evidence needs to be amassed from documents that are associated with the given object. We present two design patterns, i.e., general reusable retrieval strategies, which are able to encompass most existing approaches from the past. One strategy combines evidence on the term level (early fusion), while the other does it on the document level (late fusion). We demonstrate the generality of these patterns by applying them to three different object retrieval tasks: expert finding, blog distillation, and vertical ranking.Comment: Proceedings of the 39th European conference on Advances in Information Retrieval (ECIR '17), 201

    Online Forum Thread Retrieval using Pseudo Cluster Selection and Voting Techniques

    Full text link
    Online forums facilitate knowledge seeking and sharing on the Web. However, the shared knowledge is not fully utilized due to information overload. Thread retrieval is one method to overcome information overload. In this paper, we propose a model that combines two existing approaches: the Pseudo Cluster Selection and the Voting Techniques. In both, a retrieval system first scores a list of messages and then ranks threads by aggregating their scored messages. They differ on what and how to aggregate. The pseudo cluster selection focuses on input, while voting techniques focus on the aggregation method. Our combined models focus on the input and the aggregation methods. The result shows that some combined models are statistically superior to baseline methods.Comment: The original publication is available at http://www.springerlink.com/. arXiv admin note: substantial text overlap with arXiv:1212.533

    External query reformulation for text-based image retrieval

    Get PDF
    In text-based image retrieval, the Incomplete Annotation Problem (IAP) can greatly degrade retrieval effectiveness. A standard method used to address this problem is pseudo relevance feedback (PRF) which updates user queries by adding feedback terms selected automatically from top ranked documents in a prior retrieval run. PRF assumes that the target collection provides enough feedback information to select effective expansion terms. This is often not the case in image retrieval since images often only have short metadata annotations leading to the IAP. Our work proposes the use of an external knowledge resource (Wikipedia) in the process of refining user queries. In our method, Wikipedia documents strongly related to the terms in user query (" definition documents") are first identified by title matching between the query and titles of Wikipedia articles. These definition documents are used as indicators to re-weight the feedback documents from an initial search run on a Wikipedia abstract collection using the Jaccard coefficient. The new weights of the feedback documents are combined with the scores rated by different indicators. Query-expansion terms are then selected based on these new weights for the feedback documents. Our method is evaluated on the ImageCLEF WikipediaMM image retrieval task using text-based retrieval on the document metadata fields. The results show significant improvement compared to standard PRF methods

    Methods for ranking user-generated text streams: a case study in blog feed retrieval

    Get PDF
    User generated content are one of the main sources of information on the Web nowadays. With the huge amount of this type of data being generated everyday, having an efficient and effective retrieval system is essential. The goal of such a retrieval system is to enable users to search through this data and retrieve documents relevant to their information needs. Among the different retrieval tasks of user generated content, retrieving and ranking streams is one of the important ones that has various applications. The goal of this task is to rank streams, as collections of documents with chronological order, in response to a user query. This is different than traditional retrieval tasks where the goal is to rank single documents and temporal properties are less important in the ranking. In this thesis we investigate the problem of ranking user-generated streams with a case study in blog feed retrieval. Blogs, like all other user generated streams, have specific properties and require new considerations in the retrieval methods. Blog feed retrieval can be defined as retrieving blogs with a recurrent interest in the topic of the given query. We define three different properties of blog feed retrieval each of which introduces new challenges in the ranking task. These properties include: 1) term mismatch in blog retrieval, 2) evolution of topics in blogs and 3) diversity of blog posts. For each of these properties, we investigate its corresponding challenges and propose solutions to overcome those challenges. We further analyze the effect of our solutions on the performance of a retrieval system. We show that taking the new properties into account for developing the retrieval system can help us to improve state of the art retrieval methods. In all the proposed methods, we specifically pay attention to temporal properties that we believe are important information in any type of streams. We show that when combined with content-based information, temporal information can be useful in different situations. Although we apply our methods to blog feed retrieval, they are mostly general methods that are applicable to similar stream ranking problems like ranking experts or ranking twitter users

    Looking at things differently: Exploring perspective recall for informal text retrieval

    Get PDF
    When retrieving informal text such as blogs, comments, contributions to discussion forums, users often want to uncover different perspectives on a given issue. To help uncover perspectives, we examine the use of query expansion against multiple external corpora. We consider two informal text retrieval tasks: blog post finding and blog finding. We operationalize the idea of uncovering multiple perspectives by query expansion against multiple corpora from different genres. We use two approaches to incorporate these perspectives: as a rank-based combination of runs and a mixture of query models. The use of external sources does indeed generate different views on a topic as becomes clear from the unique relevant results identified by the expanded runs compared to the baseline run. Even after combining the expanded run with the original run, unique relevant documents are found by both of the perspectives. As to the combination methods, the mixture of query models outperforms the rank combination, and leads to significant improvements in MAP score over the baseline

    Using temporal evidence in blog search

    Get PDF
    In this paper we present a study on the relevance of web documents over time and the use of temporal evidence in blog search tasks. Time is an intrinsic property of social media, most notably in blogs where each post is typically attached with a timestamp representing its publish date. However, due to the challenges in obtaining document collections containing temporal information, research on this field has been scarce. We base our study on the Blog06 collection and the relevance assessments produced in the context of the TREC Blog Track, to investigate the relevance of time-based features in standard retrieval tasks. We observe small, but statistically significant improvements over a BM25 baseline when temporal information is used. Also, we find a direct connection between recency and relevance of documents for ad-hoc retrieval

    The voting model for people search

    Get PDF
    The thesis investigates how persons in an enterprise organisation can be ranked in response to a query, so that those persons with relevant expertise to the query topic are ranked first. The expertise areas of the persons are represented by documentary evidence of expertise, known as candidate profiles. The statement of this research work is that the expert search task in an enterprise setting can be successfully and effectively modelled using a voting paradigm. In the so-called Voting Model, when a document is retrieved for a query, this document represents a vote for every expert associated with the document to have relevant expertise to the query topic. This voting paradigm is manifested by the proposition of various voting techniques that aggregate the votes from documents to candidate experts. Moreover, the research work demonstrates that these voting techniques can be modelled in terms of a Bayesian belief network, providing probabilistic semantics for the proposed voting paradigm. The proposed voting techniques are thoroughly evaluated on three standard expert search test collections, deriving conclusions concerning each component of the Voting Model, namely the method used to identify the documents that represent each candidate's expertise areas, the weighting models that are used to rank the documents, and the voting techniques which are used to convert the ranking of documents into the ranking of experts. Effective settings are identified and insights about the behaviour of each voting technique are derived. Moreover, the practical aspects of deploying an expert search engine such as its efficiency and how it should be trained are also discussed. This thesis includes an investigation of the relationship between the quality of the underlying ranking of documents and the resulting effectiveness of the voting techniques. The thesis shows that various effective document retrieval approaches have a positive impact on the performance of the voting techniques. Interestingly, it also shows that a `perfect' ranking of documents does not necessarily translate into an equally perfect ranking of candidates. Insights are provided into the reasons for this, which relate to the complexity of evaluating tasks based on ranking aggregates of documents. Furthermore, it is shown how query expansion can be adapted and integrated into the expert search process, such that the query expansion successfully acts on a pseudo-relevant set containing only a list of names of persons. Five ways of performing query expansion in the expert search task are proposed, which vary in the extent to which they tackle expert search-specific problems, in particular, the occurrence of topic drift within the expertise evidence for each candidate. Not all documentary evidence of expertise for a given person are equally useful, nor may there be sufficient expertise evidence for a relevant person within an enterprise. This thesis investigates various approaches to identify the high quality evidence for each person, and shows how the World Wide Web can be mined as a resource to find additional expertise evidence. This thesis also demonstrates how the proposed model can be applied to other people search tasks such as ranking blog(ger)s in the blogosphere setting, and suggesting reviewers for the submitted papers to an academic conference. The central contributions of this thesis are the introduction of the Voting Model, and the definition of a number of voting techniques within the model. The thesis draws insights from an extremely large and exhaustive set of experiments, involving many experimental parameters, and using different test collections for several people search tasks. This illustrates the effectiveness and the generality of the Voting Model at tackling various people search tasks and, indeed, the retrieval of aggregates of documents in general

    Blog feed search with a post index

    Full text link
    corecore