Search CORE

16 research outputs found

Efficient Diversification of Web Search Results

Author: Capannini Gabriele
Nardini Franco Maria
Perego Raffaele
Silvestri Fabrizio
Publication venue
Publication date: 01/01/2011
Field of study

In this paper we analyze the efficiency of various search results diversification methods. While efficacy of diversification approaches has been deeply investigated in the past, response time and scalability issues have been rarely addressed. A unified framework for studying performance and feasibility of result diversification solutions is thus proposed. First we define a new methodology for detecting when, and how, query results need to be diversified. To this purpose, we rely on the concept of "query refinement" to estimate the probability of a query to be ambiguous. Then, relying on this novel ambiguity detection method, we deploy and compare on a standard test set, three different diversification methods: IASelect, xQuAD, and OptSelect. While the first two are recent state-of-the-art proposals, the latter is an original algorithm introduced in this paper. We evaluate both the efficiency and the effectiveness of our approach against its competitors by using the standard TREC Web diversification track testbed. Results shown that OptSelect is able to run two orders of magnitude faster than the two other state-of-the-art approaches and to obtain comparable figures in diversification effectiveness.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling

Author: Ron Dana
Tsur Gilad
Publication venue
Publication date: 20/04/2014
Field of study

We study a basic problem of approximating the size of an unknown set

S

in a known universe

U

. We consider two versions of the problem. In both versions the algorithm can specify subsets

T\subseteq U

. In the first version, which we refer to as the group query or subset query version, the algorithm is told whether

T\cap S

is non-empty. In the second version, which we refer to as the subset sampling version, if

T\cap S

is non-empty, then the algorithm receives a uniformly selected element from

T\cap S

. We study the difference between these two versions under different conditions on the subsets that the algorithm may query/sample, and in both the case that the algorithm is adaptive and the case where it is non-adaptive. In particular we focus on a natural family of allowed subsets, which correspond to intervals, as well as variants of this family

arXiv.org e-Print Archive

CiteSeerX

MWAND: A New Early Termination Algorithm for Fast and Efficient Query Evaluation

Author: Lougmiri Zekri
Mansouria Zemani Imene
Mohamed Senouci
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 17/03/2022
Field of study

Nowadays, current information systems are so large and maintain huge amount of data. At every time, they process millions of documents and millions of queries. In order to choose the most important responses from this amount of data, it is well to apply what is so called early termination algorithms. These ones attempt to extract the Top-K documents according to a specified increasing monotone function. The principal idea behind is to reach and score the most significant less number of documents. So, they avoid fully processing the whole documents. WAND algorithm is at the state of the art in this area. Despite it is efficient, it is missing effectiveness and precision. In this paper, we propose two contributions, the principal proposal is a new early termination algorithm based on WAND approach, we call it MWAND (Modified WAND). This one is faster and more precise than the first. It has the ability to avoid unnecessary WAND steps. In this work, we integrate a tree structure as an index into WAND and we add new levels in query processing. In the second contribution, we define new fine metrics to ameliorate the evaluation of the retrieved information. The experimental results on real datasets show that MWAND is more efficient than the WAND approach

Re-UNIR

Stichprobenziehung für Online-Inhaltsanalysen: Suchmaschinen und Filter Bubbles

Author: Emmer Martin
Strippel Christian
Publication venue: Berlin
Publication date: 01/01/2015
Field of study

Stichproben für Inhaltsanalysen im Internet sind aufgrund der spezifischen Eigenschaften von Online-Inhalten nur schwer zu realisieren. Häufig greift die Forschung auf Suchmaschinen zurück, um mit ihrer Hilfe öffentliche Kommunikation im Internet zu erfassen. Dabei ergeben sich jedoch verschiedene Probleme, die einerseits mit einem noch stark massenmedial geprägten Medienbegriff und andererseits mit den Algorithmen der eingesetzten Suchmaschinen zu tun haben. In einer Studie anhand verschiedener Suchanfragen bei Google wurde untersucht, welche Folgen ihr Einsatz für Online-Stichprobenziehungen haben kann. Die Ergebnisse zeigen, dass Suchmaschinen erstens eine Öffentlichkeit konstruieren, die weit über die herkömmliche massenmediale Öffentlichkeit hinausgeht und zweitens zu erheblichen Unterschieden je nach suchender Person führt. Als Schlussfolgerung sollten in Zukunft verschiedene Nutzungspraktiken berücksichtigt und Analysen auf alle auffindbaren anstatt nur auf journalistische Angebote bezogen werden.The characteristics of the internet make it challenging to draw samples for content analyses of online media. To reproduce public online communication, researchers frequently – and predominantly – rely on search engines to find relevant content. But this approach involves a series of problems, which are due in part to the specifics of search engine algorithms but also to the tendency to orient studies of public media to mass-media. This study analyses the effects of using search engines to collect samples for two content analyses. The results show that online publics display a much greater range of content than classical mass media and that search results differ strongly between searchers. More representative sampling should thus attempt to more closely mirror usage patterns and include a greater variety of the public media that exists beyond the channels of professional mass media coverage

SSOAR - Social Science Open Access Repository

Estimating corpus size via queries

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Crossref

Changing the focus: worker-centric optimization in human-in-the-loop computations

Author: Esfandiari Mohammadreza
Publication venue: Digital Commons @ NJIT
Publication date: 31/08/2020
Field of study

A myriad of emerging applications from simple to complex ones involve human cognizance in the computation loop. Using the wisdom of human workers, researchers have solved a variety of problems, termed as “micro-tasks” such as, captcha recognition, sentiment analysis, image categorization, query processing, as well as “complex tasks” that are often collaborative, such as, classifying craters on planetary surfaces, discovering new galaxies (Galaxyzoo), performing text translation. The current view of “humans-in-the-loop” tends to see humans as machines, robots, or low-level agents used or exploited in the service of broader computation goals. This dissertation is developed to shift the focus back to humans, and study different data analytics problems, by recognizing characteristics of the human workers, and how to incorporate those in a principled fashion inside the computation loop. The first contribution of this dissertation is to propose an optimization framework and a real world system to personalize worker’s behavior by developing a worker model and using that to better understand and estimate task completion time. The framework judiciously frames questions and solicits worker feedback on those to update the worker model. Next, improving workers skills through peer interaction during collaborative task completion is studied. A suite of optimization problems are identified in that context considering collaborativeness between the members as it plays a major role in peer learning. Finally, “diversified” sequence of work sessions for human workers is designed to improve worker satisfaction and engagement while completing tasks

Digital Commons @ New Jersey Institute of Technology (NJIT)

Estimating search engine index size variability: a 9-year longitudinal study

Author: A Anagnostopoulos
A Broder
A Kilgarriff
A Kilgarriff
A Spink
A Uyar
Antal van den Bosch
D Lewandowski
GK Zipf
H Turtle
J Bar-Ilan
J Bar-Ilan
J Bar-Ilan
J Rice
L Vaughan
M Henzinger
M Thelwall
M Thelwall
M Thelwall
M Zimmer
Maurice de Kunder
N Payne
R Rousseau
S Lawrence
S Lawrence
Toine Bogers
Y Hirate
Z Bar-Yossef
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Estimating Deep Web Properties by Random Walk

Author: Sinha Sajib Kumer
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2013
Field of study

The deep web is the part of World Wide Web that is hidden under form-like interfaces and can be accessed by queries only. Global properties of a deep web data source such as average degree, population size need to be estimated because the data in its entirety is not available. When a deep web data source is modelled as a document-term bipartite graph, the estimation can be performed by random walks on this graph. This thesis conducts comparative studies on various random walk sampling methods, including Simple Random Walk (SRW), Rejection Random Walk (RRW), Metropolis-Hastings Random Walk (MHRW) and uniform random sampling. Since random walks are conducted by queries in searchable interfaces, our study has focused on the overall sampling cost and the estimator performance in terms of bias, variance and RRMSE in this particular setting. From our experiments performed on Newsgroup data we find that MHRW results higher variance and RRMSE especially when the degree distribution follows the power law. On the other hand RRW performs worse in terms of query cost as it rejects too many samples. Compared to MHRW and RRW, SRW has low variance and RRMSE. Besides, SRW outperforms the real uniform random samples when the distribution follows the power law

Scholarship at UWindsor

Exploiting links and text structure on the Web : a quantitative approach to improving search quality

Author: Kohlschütter Christian
Publication venue: Hannover : Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2011
Field of study

[no abstract

Institutionelles Repositorium der Leibniz Universität Hannover

Query Log Mining to Enhance User Experience in Search Engines

Author: NARDINI FRANCO MARIA
Publication venue: 'Pisa University Press'
Publication date: 01/01/2011
Field of study

The Web is the biggest repository of documents humans have ever built. Even more, it is increasingly growing in size every day. Users rely on Web search engines (WSEs) for finding information on the Web. By submitting a textual query expressing their information need, WSE users obtain a list of documents that are highly relevant to the query. Moreover, WSEs tend to store such huge amount of users activities in "query logs". Query log mining is the set of techniques aiming at extracting valuable knowledge from query logs. This knowledge represents one of the most used ways of enhancing the users’ search experience. According to this vision, in this thesis we firstly prove that the knowledge extracted from query logs suffer aging effects and we thus propose a solution to this phenomenon. Secondly, we propose new algorithms for query recommendation that overcome the aging problem. Moreover, we study new query recommendation techniques for efficiently producing recommendations for rare queries. Finally, we study the problem of diversifying Web search engine results. We define a methodology based on the knowledge derived from query logs for detecting when and how query results need to be diversified and we develop an efficient algorithm for diversifying search results

CiteSeerX

Electronic Thesis and Dissertation Archive - Università di Pisa