84 research outputs found
Delphic Costs and Benefits in Web Search: A utilitarian and historical analysis
We present a new framework to conceptualize and operationalize the total user
experience of search, by studying the entirety of a search journey from an
utilitarian point of view.
Web search engines are widely perceived as "free". But search requires time
and effort: in reality there are many intermingled non-monetary costs (e.g.
time costs, cognitive costs, interactivity costs) and the benefits may be
marred by various impairments, such as misunderstanding and misinformation.
This characterization of costs and benefits appears to be inherent to the human
search for information within the pursuit of some larger task: most of the
costs and impairments can be identified in interactions with any web search
engine, interactions with public libraries, and even in interactions with
ancient oracles. To emphasize this innate connection, we call these costs and
benefits Delphic, in contrast to explicitly financial costs and benefits.
Our main thesis is that the users' satisfaction with a search engine mostly
depends on their experience of Delphic cost and benefits, in other words on
their utility. The consumer utility is correlated with classic measures of
search engine quality, such as ranking, precision, recall, etc., but is not
completely determined by them. To argue our thesis, we catalog the Delphic
costs and benefits and show how the development of search engines over the last
quarter century, from classic Information Retrieval roots to the integration of
Large Language Models, was driven to a great extent by the quest of decreasing
Delphic costs and increasing Delphic benefits.
We hope that the Delphic costs framework will engender new ideas and new
research for evaluating and improving the web experience for everyone.Comment: 10 page
On-line load balancing
AbstractThe setup for our problem consists of n servers that must complete a set of tasks. Each task can be handled only by a subset of the servers, requires a different level of service, and once assigned cannot be reassigned. We make the natural assumption that the level of service is known at arrival time, but that the duration of service is not. The on-line load balancing problem is to assign each task to an appropriate server in such a way that the maximum load on the servers is minimized. In this paper we derive matching upper and lower bounds for the competitive ratio of the on-line greedy algorithm for this problem, namely, [(3n)23/2](1+o(1)), and derive a lower bound, Ω(n12), for any other deterministic or randomized on-line algorithm
Nobody cares if you liked Star Wars: KNN graph construction on the cheap
International audienceK-Nearest-Neighbors (KNN) graphs play a key role in a large range of applications. A KNN graph typically connects entities characterized by a set of features so that each entity becomes linked to its k most similar counterparts according to some similarity function. As datasets grow, KNN graphs are unfortunately becoming increasingly costly to construct, and the general approach, which consists in reducing the number of comparisons between entities, seems to have reached its full potential. In this paper we propose to overcome this limit with a simple yet powerful strategy that samples the set of features of each entity and only keeps the least popular features. We show that this strategy outperforms other more straightforward policies on a range of four representative datasets: for instance, keeping the 25 least popular items reduces computational time by up to 63%, while producing a KNN graph close to the ideal one
Fair Near Neighbor Search: Independent Range Sampling in High Dimensions. PODS
Similarity search is a fundamental algorithmic primitive, widely used in many
computer science disciplines. There are several variants of the similarity
search problem, and one of the most relevant is the -near neighbor (-NN)
problem: given a radius and a set of points , construct a data
structure that, for any given query point , returns a point within
distance at most from . In this paper, we study the -NN problem in
the light of fairness. We consider fairness in the sense of equal opportunity:
all points that are within distance from the query should have the same
probability to be returned. In the low-dimensional case, this problem was first
studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the
theoretically strongest approach to similarity search in high dimensions, does
not provide such a fairness guarantee. To address this, we propose efficient
data structures for -NN where all points in that are near have the
same probability to be selected and returned by the query. Specifically, we
first propose a black-box approach that, given any LSH scheme, constructs a
data structure for uniformly sampling points in the neighborhood of a query.
Then, we develop a data structure for fair similarity search under inner
product that requires nearly-linear space and exploits locality sensitive
filters. The paper concludes with an experimental evaluation that highlights
(un)fairness in a recommendation setting on real-world datasets and discusses
the inherent unfairness introduced by solving other variants of the problem.Comment: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
Principles of Database Systems (PODS), Pages 191-204, June 202
Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018
Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 201
Distance-Sensitive Hashing
Locality-sensitive hashing (LSH) is an important tool for managing
high-dimensional noisy or uncertain data, for example in connection with data
cleaning (similarity join) and noise-robust search (similarity search).
However, for a number of problems the LSH framework is not known to yield good
solutions, and instead ad hoc solutions have been designed for particular
similarity and distance measures. For example, this is true for
output-sensitive similarity search/join, and for indexes supporting annulus
queries that aim to report a point close to a certain given distance from the
query point.
In this paper we initiate the study of distance-sensitive hashing (DSH), a
generalization of LSH that seeks a family of hash functions such that the
probability of two points having the same hash value is a given function of the
distance between them. More precisely, given a distance space and a "collision probability function" (CPF) we seek a distribution over pairs of functions
such that for every pair of points the collision
probability is . Locality-sensitive
hashing is the study of how fast a CPF can decrease as the distance grows. For
many spaces, can be made exponentially decreasing even if we restrict
attention to the symmetric case where . We show that the asymmetry
achieved by having a pair of functions makes it possible to achieve CPFs that
are, for example, increasing or unimodal, and show how this leads to principled
solutions to problems not addressed by the LSH framework. This includes a novel
application to privacy-preserving distance estimation. We believe that the DSH
framework will find further applications in high-dimensional data management.Comment: Accepted at PODS'18. Abstract shortened due to character limi
- …