1,375 research outputs found
Entity Ranking on Graphs: Studies on Expert Finding
Todays web search engines try to offer services for finding various information in addition to simple web pages, like showing locations or answering simple fact queries. Understanding the association of named entities and documents is one of the key steps towards such semantic search tasks. This paper addresses the ranking of entities and models it in a graph-based relevance propagation framework. In particular we study the problem of expert finding as an example of an entity ranking task. Entity containment graphs are introduced that represent the relationship between text fragments on the one hand and their contained entities on the other hand. The paper shows how these graphs can be used to propagate relevance information from the pre-ranked text fragments to their entities. We use this propagation framework to model existing approaches to expert finding based on the entity's indegree and extend them by recursive relevance propagation based on a probabilistic random walk over the entity containment graphs. Experiments on the TREC expert search task compare the retrieval performance of the different graph and propagation models
Data Mining in Electronic Commerce
Modern business is rushing toward e-commerce. If the transition is done
properly, it enables better management, new services, lower transaction costs
and better customer relations. Success depends on skilled information
technologists, among whom are statisticians. This paper focuses on some of the
contributions that statisticians are making to help change the business world,
especially through the development and application of data mining methods. This
is a very large area, and the topics we cover are chosen to avoid overlap with
other papers in this special issue, as well as to respect the limitations of
our expertise. Inevitably, electronic commerce has raised and is raising fresh
research problems in a very wide range of statistical areas, and we try to
emphasize those challenges.Comment: Published at http://dx.doi.org/10.1214/088342306000000204 in the
Statistical Science (http://www.imstat.org/sts/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Improved Distortion and Spam Resistance for PageRank
For a directed graph , a ranking function, such as PageRank,
provides a way of mapping elements of to non-negative real numbers so that
nodes can be ordered. Brin and Page argued that the stationary distribution,
, of a random walk on is an effective ranking function for queries on
an idealized web graph. However, is not defined for all , and in
particular, it is not defined for the real web graph. Thus, they introduced
PageRank to approximate for graphs with ergodic random walks while
being defined on all graphs.
PageRank is defined as a random walk on a graph, where with probability
, a random out-edge is traversed, and with \emph{reset
probability} the random walk instead restarts at a node selected
using a \emph{reset vector} . Originally, was taken to be
uniform on the nodes, and we call this version UPR.
In this paper, we introduce graph-theoretic notions of quality for ranking
functions, specifically \emph{distortion} and \emph{spam resistance}. We show
that UPR has high distortion and low spam resistance and we show how to select
an that yields low distortion and high spam resistance.Comment: 36 page
BlogForever D2.4: Weblog spider prototype and associated methodology
The purpose of this document is to present the evaluation of different solutions for capturing blogs, established methodology and to describe the developed blog spider prototype
PolaritySpam: Propagating Content-based Information Through a Web-Graph to Detect Web Spam
Spam web pages have become a problem for Information Retrieval systems
due to the negative effects that this phenomenon can cause in their results. In this work
we tackle the problem of detecting these pages with a propagation algorithm that, taking
as input a web graph, chooses a set of spam and not-spam web pages in order to spread
their spam likelihood over the rest of the network. Thus we take advantage of the links
between pages to obtain a ranking of pages according to their relevance and their spam
likelihood. Our intuition consists in giving a high reputation to those pages related to
relevant ones, and giving a high spam likelihood to the pages linked to spam web pages.
We introduce the novelty of including the content of the web pages in the computation of
an a priori estimation of the spam likelihood of the pages, and propagate this information.
Our graph-based algorithm computes two scores for each node in the graph. Intuitively,
these values represent how bad or good (spam-like or not) is a web page, according to its
textual content and its relations in the graph. The experimental results show that our
method outperforms other techniques for spam detectionMinisterio de EducaciĂłn y Ciencia HUM2007-66607-C04-0
- âŠ