50,488 research outputs found
Distributed Information Retrieval using Keyword Auctions
This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions
PageRank optimization applied to spam detection
We give a new link spam detection and PageRank demotion algorithm called
MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of hand-picked
trusted and spam pages. We define the MaxRank of a page as the frequency of
visit of this page by a random surfer minimizing an average cost per time unit.
On a given page, the random surfer selects a set of hyperlinks and clicks with
uniform probability on any of these hyperlinks. The cost function penalizes
spam pages and hyperlink removals. The goal is to determine a hyperlink
deletion policy that minimizes this score. The MaxRank is interpreted as a
modified PageRank vector, used to sort web pages instead of the usual PageRank
vector. The bias vector of this ergodic control problem, which is unique up to
an additive constant, is a measure of the "spamicity" of each page, used to
detect spam pages. We give a scalable algorithm for MaxRank computation that
allowed us to perform experimental results on the WEBSPAM-UK2007 dataset. We
show that our algorithm outperforms both TrustRank and AntiTrustRank for spam
and nonspam page detection.Comment: 8 pages, 6 figure
Cross-Paced Representation Learning with Partial Curricula for Sketch-based Image Retrieval
In this paper we address the problem of learning robust cross-domain
representations for sketch-based image retrieval (SBIR). While most SBIR
approaches focus on extracting low- and mid-level descriptors for direct
feature matching, recent works have shown the benefit of learning coupled
feature representations to describe data from two related sources. However,
cross-domain representation learning methods are typically cast into non-convex
minimization problems that are difficult to optimize, leading to unsatisfactory
performance. Inspired by self-paced learning, a learning methodology designed
to overcome convergence issues related to local optima by exploiting the
samples in a meaningful order (i.e. easy to hard), we introduce the cross-paced
partial curriculum learning (CPPCL) framework. Compared with existing
self-paced learning methods which only consider a single modality and cannot
deal with prior knowledge, CPPCL is specifically designed to assess the
learning pace by jointly handling data from dual sources and modality-specific
prior information provided in the form of partial curricula. Additionally,
thanks to the learned dictionaries, we demonstrate that the proposed CPPCL
embeds robust coupled representations for SBIR. Our approach is extensively
evaluated on four publicly available datasets (i.e. CUFS, Flickr15K, QueenMary
SBIR and TU-Berlin Extension datasets), showing superior performance over
competing SBIR methods
Cost-aware caching: optimizing cache provisioning and object placement in ICN
Caching is frequently used by Internet Service Providers as a viable
technique to reduce the latency perceived by end users, while jointly
offloading network traffic. While the cache hit-ratio is generally considered
in the literature as the dominant performance metric for such type of systems,
in this paper we argue that a critical missing piece has so far been neglected.
Adopting a radically different perspective, in this paper we explicitly account
for the cost of content retrieval, i.e. the cost associated to the external
bandwidth needed by an ISP to retrieve the contents requested by its customers.
Interestingly, we discover that classical cache provisioning techniques that
maximize cache efficiency (i.e., the hit-ratio), lead to suboptimal solutions
with higher overall cost. To show this mismatch, we propose two optimization
models that either minimize the overall costs or maximize the hit-ratio,
jointly providing cache sizing, object placement and path selection. We
formulate a polynomial-time greedy algorithm to solve the two problems and
analytically prove its optimality. We provide numerical results and show that
significant cost savings are attainable via a cost-aware design
- …