11 research outputs found

    Diversification Based Static Index Pruning - Application to Temporal Collections

    Full text link
    Nowadays, web archives preserve the history of large portions of the web. As medias are shifting from printed to digital editions, accessing these huge information sources is drawing increasingly more attention from national and international institutions, as well as from the research community. These collections are intrinsically big, leading to index files that do not fit into the memory and an increase query response time. Decreasing the index size is a direct way to decrease this query response time. Static index pruning methods reduce the size of indexes by removing a part of the postings. In the context of web archives, it is necessary to remove postings while preserving the temporal diversity of the archive. None of the existing pruning approaches take (temporal) diversification into account. In this paper, we propose a diversification-based static index pruning method. It differs from the existing pruning approaches by integrating diversification within the pruning context. We aim at pruning the index while preserving retrieval effectiveness and diversity by pruning while maximizing a given IR evaluation metric like DCG. We show how to apply this approach in the context of web archives. Finally, we show on two collections that search effectiveness in temporal collections after pruning can be improved using our approach rather than diversity oblivious approaches

    On the suitability of intent spaces for IR diversification

    Full text link
    This is an electronic version of the paper presented at the International Workshop on Diversity in Document Retrieval (DDR 2012), held in Seattle on 2012Recent developments in Information Retrieval diversity are based on the consideration of a space of information need aspects, a notion which takes different forms in the literature. The choice of a suitable aspect space for diversification is a critical issue when designing an IR diversification strategy, which has not been explicitly addressed to some depth in the literature. This paper aims to identify relevant properties of the aspect space which may help the system designer in making a suitable choice in selecting and configuring this space, and diagnosing malfunctions of the diversification algorithms. In particular, we identify the mutual information between aspects and documents as a meaningful magnitude, in terms of which anomalous cases can be characterized. We further seek to discern favorable cases through a combination of theoretic and empirical analysis.This work is supported by the Spanish Government (TIN2011-28538-C02-01), and the Government of Madrid (S2009TIC-1542)

    Diverse near neighbor problem

    Get PDF
    Motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in SS (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query qq is linear in nn. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees.Charles Stark Draper LaboratoryNational Science Foundation (U.S.) (Award NSF CCF-1012042)David & Lucile Packard Foundatio

    Explicit relevance models in intent-oriented information retrieval diversification

    Full text link
    This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, http://dx.doi.org/10.1145/2348283.2348297.The intent-oriented search diversification methods developed in the field so far tend to build on generative views of the retrieval system to be diversified. Core algorithm components in particular redundancy assessment are expressed in terms of the probability to observe documents, rather than the probability that the documents be relevant. This has been sometimes described as a view considering the selection of a single document in the underlying task model. In this paper we propose an alternative formulation of aspect-based diversification algorithms which explicitly includes a formal relevance model. We develop means for the effective computation of the new formulation, and we test the resulting algorithm empirically. We report experiments on search and recommendation tasks showing competitive or better performance than the original diversification algorithms. The relevance-based formulation has further interesting properties, such as unifying two well-known state of the art algorithms into a single version. The relevance-based approach opens alternative possibilities for further formal connections and developments as natural extensions of the framework. We illustrate this by modeling tolerance to redundancy as an explicit configurable parameter, which can be set to better suit the characteristics of the IR task, or the evaluation metrics, as we illustrate empirically.This work was supported by the national Spanish projects TIN2011-28538-C02-01 and S2009TIC-1542

    Search Bias Quantification: Investigating Political Bias in Social Media and Web Search

    No full text
    Users frequently use search systems on the Web as well as online social media to learn about ongoing events and public opinion on personalities. Prior studies have shown that the top-ranked results returned by these search engines can shape user opinion about the topic (e.g., event or person) being searched. In case of polarizing topics like politics, where multiple competing perspectives exist, the political bias in the top search results can play a significant role in shaping public opinion towards (or away from) certain perspectives. Given the considerable impact that search bias can have on the user, we propose a generalizable search bias quantification framework that not only measures the political bias in ranked list output by the search system but also decouples the bias introduced by the different sources—input data and ranking system. We apply our framework to study the political bias in searches related to 2016 US Presidential primaries in Twitter social media search and find that both input data and ranking system matter in determining the final search output bias seen by the users. And finally, we use the framework to compare the relative bias for two popular search systems—Twitter social media search and Google web search—for queries related to politicians and political events. We end by discussing some potential solutions to signal the bias in the search results to make the users more aware of them.publishe

    Proportional Rankings

    Full text link
    In this paper we extend the principle of proportional representation to rankings. We consider the setting where alternatives need to be ranked based on approval preferences. In this setting, proportional representation requires that cohesive groups of voters are represented proportionally in each initial segment of the ranking. Proportional rankings are desirable in situations where initial segments of different lengths may be relevant, e.g., hiring decisions (if it is unclear how many positions are to be filled), the presentation of competing proposals on a liquid democracy platform (if it is unclear how many proposals participants are taking into consideration), or recommender systems (if a ranking has to accommodate different user types). We study the proportional representation provided by several ranking methods and prove theoretical guarantees. Furthermore, we experimentally evaluate these methods and present preliminary evidence as to which methods are most suitable for producing proportional rankings

    Diverse sampling of streaming data

    Get PDF
    Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (pages 49-51).This thesis addresses the problem of diverse sampling as a dispersion problem and proposes solutions that are optimized for large streaming data. Finding the optimal solution to the dispersion problem is NP-hard. Therefore, existing and proposed solutions are approximation algorithms. This work evaluates the performance of dierent algorithms in practice and compares them to the theoretical guarantees.by Aizana Turmukhametova.M. Eng

    Approximate nearest neighbor and its many variants

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (p. 53-55).This thesis investigates two variants of the approximate nearest neighbor problem. First, motivated by the recent research on diversity-aware search, we investigate the k-diverse near neighbor reporting problem. The problem is defined as follows: given a query point q, report the maximum diversity set S of k points in the ball of radius r around q. The diversity of a set S is measured by the minimum distance between any pair of points in S (the higher, the better). We present two approximation algorithms for the case where the points live in a d-dimensional Hamming space. Our algorithms guarantee query times that are sub-linear in n and only polynomial in the diversity parameter k, as well as the dimension d. For low values of k, our algorithms achieve sub-linear query times even if the number of points within distance r from a query q is linear in n. To the best of our knowledge, these are the first known algorithms of this type that offer provable guarantees. In the other variant, we consider the approximate line near neighbor (LNN) problem. Here, the database consists of a set of lines instead of points but the query is still a point. Let L be a set of n lines in the d dimensional euclidean space Rd. The goal is to preprocess the set of lines so that we can answer the Line Near Neighbor (LNN) queries in sub-linear time. That is, given the query point ... we want to report a line ... (if there is any), such that ... for some threshold value r, where ... is the euclidean distance between them. We start by illustrating the solution to the problem in the case where there are only two lines in the database and present a data structure in this case. Then we show a recursive algorithm that merges these data structures and solve the problem for the general case of n lines. The algorithm has polynomial space and performs only a logarithmic number of calls to the approximate nearest neighbor subproblem.by Sepideh Mahabadi.S.M
    corecore