923 research outputs found

    Stochastic Query Covering for Fast Approximate Document Retrieval

    Get PDF
    We design algorithms that, given a collection of documents and a distribution over user queries, return a small subset of the document collection in such a way that we can efficiently provide high-quality answers to user queries using only the selected subset. This approach has applications when space is a constraint or when the query-processing time increases significantly with the size of the collection. We study our algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction of the entire collection, they can provide answers to most user queries, achieving a performance close to the optimal. To complement our theoretical findings, we experimentally show the versatility of our approach by considering two important cases in the context of Web search. In the first case, we favor the retrieval of documents that are relevant to the query, whereas in the second case we aim for document diversification. Both the theoretical and the experimental analysis provide strong evidence of the potential value of query covering in diverse application scenarios

    Modeling Temporal Evidence from External Collections

    Full text link
    Newsworthy events are broadcast through multiple mediums and prompt the crowds to produce comments on social media. In this paper, we propose to leverage on this behavioral dynamics to estimate the most relevant time periods for an event (i.e., query). Recent advances have shown how to improve the estimation of the temporal relevance of such topics. In this approach, we build on two major novelties. First, we mine temporal evidences from hundreds of external sources into topic-based external collections to improve the robustness of the detection of relevant time periods. Second, we propose a formal retrieval model that generalizes the use of the temporal dimension across different aspects of the retrieval process. In particular, we show that temporal evidence of external collections can be used to (i) infer a topic's temporal relevance, (ii) select the query expansion terms, and (iii) re-rank the final results for improved precision. Experiments with TREC Microblog collections show that the proposed time-aware retrieval model makes an effective and extensive use of the temporal dimension to improve search results over the most recent temporal models. Interestingly, we observe a strong correlation between precision and the temporal distribution of retrieved and relevant documents.Comment: To appear in WSDM 201

    Learning to Diversify Web Search Results with a Document Repulsion Model

    Get PDF
    Search diversification (also called diversity search), is an important approach to tackling the query ambiguity problem in information retrieval. It aims to diversify the search results that are originally ranked according to their probabilities of relevance to a given query, by re-ranking them to cover as many as possible different aspects (or subtopics) of the query. Most existing diversity search models heuristically balance the relevance ranking and the diversity ranking, yet lacking an efficient learning mechanism to reach an optimized parameter setting. To address this problem, we propose a learning-to-diversify approach which can directly optimize the search diversification performance (in term of any effectiveness metric). We first extend the ranking function of a widely used learning-to-rank framework, i.e., LambdaMART, so that the extended ranking function can correlate relevance and diversity indicators. Furthermore, we develop an effective learning algorithm, namely Document Repulsion Model (DRM), to train the ranking function based on a Document Repulsion Theory (DRT). DRT assumes that two result documents covering similar query aspects (i.e., subtopics) should be mutually repulsive, for the purpose of search diversification. Accordingly, the proposed DRM exerts a repulsion force between each pair of similar documents in the learning process, and includes the diversity effectiveness metric to be optimized as part of the loss function. Although there have been existing learning based diversity search methods, they often involve an iterative sequential selection process in the ranking process, which is computationally complex and time consuming for training, while our proposed learning strategy can largely reduce the time cost. Extensive experiments are conducted on the TREC diversity track data (2009, 2010 and 2011). The results demonstrate that our model significantly outperforms a number of baselines in terms of effectiveness and robustness. Further, an efficiency analysis shows that the proposed DRM has a lower computational complexity than the state of the art learning-to-diversify methods

    Combining implicit and explicit topic representations for result diversification

    Get PDF
    Result diversification deals with ambiguous or multi-faceted queries by providing documents that cover as many subtopics of a query as possible. Various approaches to subtopic modeling have been proposed. Subtopics have been extracted internally, e.g., from retrieved documents, and externally, e.g., from Web resources such as query logs. Internally modeled subtopics are often implicitly represented, e.g., as latent topics, while externally modeled subtopics are often explicitly represented, e.g., as reformulated queries. We propose a framework that: i) combines both implicitly and explicitly represented subtopics; and ii) allows flexible combination of multiple external resources in a transparent and unified manner. Specifically, we use a random walk based approach to estimate the similarities of the explicit subtopics mined from a number of heterogeneous resources: click logs, anchor text, and web n-grams. We then use these similarities to regularize the latent topics extracted from the top-ranked documents, i.e., the internal (implicit) subtopics. Empirical results show that regularization with explicit subtopics extracted from the right resource leads to improved diversification results, indicating that the proposed regularization with (explicit) external resources forms better (implicit) topic models. Click logs and anchor text are shown to be more effective resources than web n-grams under current experimental settings. Combining resources does not always lead to better results, but achieves a robust performance. This robustness is important for two reasons: it cannot be predicted which resources will be most effective for a given query, and it is not yet known how to reliably determine the optimal model parameters for building implicit topic models

    Temporal models for mining, ranking and recommendation in the Web

    Get PDF
    Due to their first-hand, diverse and evolution-aware reflection of nearly all areas of life, heterogeneous temporal datasets i.e., the Web, collaborative knowledge bases and social networks have been emerged as gold-mines for content analytics of many sorts. In those collections, time plays an essential role in many crucial information retrieval and data mining tasks, such as from user intent understanding, document ranking to advanced recommendations. There are two semantically closed and important constituents when modeling along the time dimension, i.e., entity and event. Time is crucially served as the context for changes driven by happenings and phenomena (events) that related to people, organizations or places (so-called entities) in our social lives. Thus, determining what users expect, or in other words, resolving the uncertainty confounded by temporal changes is a compelling task to support consistent user satisfaction. In this thesis, we address the aforementioned issues and propose temporal models that capture the temporal dynamics of such entities and events to serve for the end tasks. Specifically, we make the following contributions in this thesis: (1) Query recommendation and document ranking in the Web - we address the issues for suggesting entity-centric queries and ranking effectiveness surrounding the happening time period of an associated event. In particular, we propose a multi-criteria optimization framework that facilitates the combination of multiple temporal models to smooth out the abrupt changes when transitioning between event phases for the former and a probabilistic approach for search result diversification of temporally ambiguous queries for the latter. (2) Entity relatedness in Wikipedia - we study the long-term dynamics of Wikipedia as a global memory place for high-impact events, specifically the reviving memories of past events. Additionally, we propose a neural network-based approach to measure the temporal relatedness of entities and events. The model engages different latent representations of an entity (i.e., from time, link-based graph and content) and use the collective attention from user navigation as the supervision. (3) Graph-based ranking and temporal anchor-text mining inWeb Archives - we tackle the problem of discovering important documents along the time-span ofWeb Archives, leveraging the link graph. Specifically, we combine the problems of relevance, temporal authority, diversity and time in a unified framework. The model accounts for the incomplete link structure and natural time lagging in Web Archives in mining the temporal authority. (4) Methods for enhancing predictive models at early-stage in social media and clinical domain - we investigate several methods to control model instability and enrich contexts of predictive models at the “cold-start” period. We demonstrate their effectiveness for the rumor detection and blood glucose prediction cases respectively. Overall, the findings presented in this thesis demonstrate the importance of tracking these temporal dynamics surround salient events and entities for IR applications. We show that determining such changes in time-based patterns and trends in prevalent temporal collections can better satisfy user expectations, and boost ranking and recommendation effectiveness over time
    • …
    corecore