16 research outputs found

    Efficient Diversification of Web Search Results

    Full text link
    In this paper we analyze the efficiency of various search results diversification methods. While efficacy of diversification approaches has been deeply investigated in the past, response time and scalability issues have been rarely addressed. A unified framework for studying performance and feasibility of result diversification solutions is thus proposed. First we define a new methodology for detecting when, and how, query results need to be diversified. To this purpose, we rely on the concept of "query refinement" to estimate the probability of a query to be ambiguous. Then, relying on this novel ambiguity detection method, we deploy and compare on a standard test set, three different diversification methods: IASelect, xQuAD, and OptSelect. While the first two are recent state-of-the-art proposals, the latter is an original algorithm introduced in this paper. We evaluate both the efficiency and the effectiveness of our approach against its competitors by using the standard TREC Web diversification track testbed. Results shown that OptSelect is able to run two orders of magnitude faster than the two other state-of-the-art approaches and to obtain comparable figures in diversification effectiveness.Comment: VLDB201

    The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling

    Full text link
    We study a basic problem of approximating the size of an unknown set SS in a known universe UU. We consider two versions of the problem. In both versions the algorithm can specify subsets T⊆UT\subseteq U. In the first version, which we refer to as the group query or subset query version, the algorithm is told whether T∩ST\cap S is non-empty. In the second version, which we refer to as the subset sampling version, if T∩ST\cap S is non-empty, then the algorithm receives a uniformly selected element from T∩ST\cap S. We study the difference between these two versions under different conditions on the subsets that the algorithm may query/sample, and in both the case that the algorithm is adaptive and the case where it is non-adaptive. In particular we focus on a natural family of allowed subsets, which correspond to intervals, as well as variants of this family

    MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation

    Get PDF
    Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach

    Stichprobenziehung fĂŒr Online-Inhaltsanalysen: Suchmaschinen und Filter Bubbles

    Get PDF
    Stichproben fĂŒr Inhaltsanalysen im Internet sind aufgrund der spezifischen Eigenschaften von Online-Inhalten nur schwer zu realisieren. HĂ€ufig greift die Forschung auf Suchmaschinen zurĂŒck, um mit ihrer Hilfe öffentliche Kommunikation im Internet zu erfassen. Dabei ergeben sich jedoch verschiedene Probleme, die einerseits mit einem noch stark massenmedial geprĂ€gten Medienbegriff und andererseits mit den Algorithmen der eingesetzten Suchmaschinen zu tun haben. In einer Studie anhand verschiedener Suchanfragen bei Google wurde untersucht, welche Folgen ihr Einsatz fĂŒr Online-Stichprobenziehungen haben kann. Die Ergebnisse zeigen, dass Suchmaschinen erstens eine Öffentlichkeit konstruieren, die weit ĂŒber die herkömmliche massenmediale Öffentlichkeit hinausgeht und zweitens zu erheblichen Unterschieden je nach suchender Person fĂŒhrt. Als Schlussfolgerung sollten in Zukunft verschiedene Nutzungspraktiken berĂŒcksichtigt und Analysen auf alle auffindbaren anstatt nur auf journalistische Angebote bezogen werden.The characteristics of the internet make it challenging to draw samples for content analyses of online media. To reproduce public online communication, researchers frequently – and predominantly – rely on search engines to find relevant content. But this approach involves a series of problems, which are due in part to the specifics of search engine algorithms but also to the tendency to orient studies of public media to mass-media. This study analyses the effects of using search engines to collect samples for two content analyses. The results show that online publics display a much greater range of content than classical mass media and that search results differ strongly between searchers. More representative sampling should thus attempt to more closely mirror usage patterns and include a greater variety of the public media that exists beyond the channels of professional mass media coverage

    Estimating corpus size via queries

    Full text link

    Changing the focus: worker-centric optimization in human-in-the-loop computations

    Get PDF
    A myriad of emerging applications from simple to complex ones involve human cognizance in the computation loop. Using the wisdom of human workers, researchers have solved a variety of problems, termed as “micro-tasks” such as, captcha recognition, sentiment analysis, image categorization, query processing, as well as “complex tasks” that are often collaborative, such as, classifying craters on planetary surfaces, discovering new galaxies (Galaxyzoo), performing text translation. The current view of “humans-in-the-loop” tends to see humans as machines, robots, or low-level agents used or exploited in the service of broader computation goals. This dissertation is developed to shift the focus back to humans, and study different data analytics problems, by recognizing characteristics of the human workers, and how to incorporate those in a principled fashion inside the computation loop. The first contribution of this dissertation is to propose an optimization framework and a real world system to personalize worker’s behavior by developing a worker model and using that to better understand and estimate task completion time. The framework judiciously frames questions and solicits worker feedback on those to update the worker model. Next, improving workers skills through peer interaction during collaborative task completion is studied. A suite of optimization problems are identified in that context considering collaborativeness between the members as it plays a major role in peer learning. Finally, “diversified” sequence of work sessions for human workers is designed to improve worker satisfaction and engagement while completing tasks

    Estimating Deep Web Properties by Random Walk

    Get PDF
    The deep web is the part of World Wide Web that is hidden under form-like interfaces and can be accessed by queries only. Global properties of a deep web data source such as average degree, population size need to be estimated because the data in its entirety is not available. When a deep web data source is modelled as a document-term bipartite graph, the estimation can be performed by random walks on this graph. This thesis conducts comparative studies on various random walk sampling methods, including Simple Random Walk (SRW), Rejection Random Walk (RRW), Metropolis-Hastings Random Walk (MHRW) and uniform random sampling. Since random walks are conducted by queries in searchable interfaces, our study has focused on the overall sampling cost and the estimator performance in terms of bias, variance and RRMSE in this particular setting. From our experiments performed on Newsgroup data we find that MHRW results higher variance and RRMSE especially when the degree distribution follows the power law. On the other hand RRW performs worse in terms of query cost as it rejects too many samples. Compared to MHRW and RRW, SRW has low variance and RRMSE. Besides, SRW outperforms the real uniform random samples when the distribution follows the power law

    Exploiting links and text structure on the Web : a quantitative approach to improving search quality

    Get PDF
    [no abstract

    Query Log Mining to Enhance User Experience in Search Engines

    Get PDF
    The Web is the biggest repository of documents humans have ever built. Even more, it is increasingly growing in size every day. Users rely on Web search engines (WSEs) for finding information on the Web. By submitting a textual query expressing their information need, WSE users obtain a list of documents that are highly relevant to the query. Moreover, WSEs tend to store such huge amount of users activities in "query logs". Query log mining is the set of techniques aiming at extracting valuable knowledge from query logs. This knowledge represents one of the most used ways of enhancing the users’ search experience. According to this vision, in this thesis we firstly prove that the knowledge extracted from query logs suffer aging effects and we thus propose a solution to this phenomenon. Secondly, we propose new algorithms for query recommendation that overcome the aging problem. Moreover, we study new query recommendation techniques for efficiently producing recommendations for rare queries. Finally, we study the problem of diversifying Web search engine results. We define a methodology based on the knowledge derived from query logs for detecting when and how query results need to be diversified and we develop an efficient algorithm for diversifying search results
    corecore