151 research outputs found
Efficient Optimally Lazy Algorithms for Minimal-Interval Semantics
Minimal-interval semantics associates with each query over a document a set
of intervals, called witnesses, that are incomparable with respect to inclusion
(i.e., they form an antichain): witnesses define the minimal regions of the
document satisfying the query. Minimal-interval semantics makes it easy to
define and compute several sophisticated proximity operators, provides snippets
for user presentation, and can be used to rank documents. In this paper we
provide algorithms for computing conjunction and disjunction that are linear in
the number of intervals and logarithmic in the number of operands; for
additional operators, such as ordered conjunction and Brouwerian difference, we
provide linear algorithms. In all cases, space is linear in the number of
operands. More importantly, we define a formal notion of optimal laziness, and
either prove it, or prove its impossibility, for each algorithm. We cast our
results in a general framework of antichains of intervals on total orders,
making our algorithms directly applicable to other domains.Comment: 24 pages, 4 figures. A preliminary (now outdated) version was
presented at SPIRE 200
Four Degrees of Separation, Really
We recently measured the average distance of users in the Facebook graph,
spurring comments in the scientific community as well as in the general press
("Four Degrees of Separation"). A number of interesting criticisms have been
made about the meaningfulness, methods and consequences of the experiment we
performed. In this paper we want to discuss some methodological aspects that we
deem important to underline in the form of answers to the questions we have
read in newspapers, magazines, blogs, or heard from colleagues. We indulge in
some reflections on the actual meaning of "average distance" and make a number
of side observations showing that, yes, 3.74 "degrees of separation" are really
few
Entity-Linking via Graph-Distance Minimization
Entity-linking is a natural-language-processing task that consists in
identifying the entities mentioned in a piece of text, linking each to an
appropriate item in some knowledge base; when the knowledge base is Wikipedia,
the problem comes to be known as wikification (in this case, items are
wikipedia articles). One instance of entity-linking can be formalized as an
optimization problem on the underlying concept graph, where the quantity to be
optimized is the average distance between chosen items. Inspired by this
application, we define a new graph problem which is a natural variant of the
Maximum Capacity Representative Set. We prove that our problem is NP-hard for
general graphs; nonetheless, under some restrictive assumptions, it turns out
to be solvable in linear time. For the general case, we propose two heuristics:
one tries to enforce the above assumptions and another one is based on the
notion of hitting distance; we show experimentally how these approaches perform
with respect to some baselines on a real-world dataset.Comment: In Proceedings GRAPHITE 2014, arXiv:1407.7671. The second and third
authors were supported by the EU-FET grant NADINE (GA 288956
A Network Model characterized by a Latent Attribute Structure with Competition
The quest for a model that is able to explain, describe, analyze and simulate
real-world complex networks is of uttermost practical as well as theoretical
interest. In this paper we introduce and study a network model that is based on
a latent attribute structure: each node is characterized by a number of
features and the probability of the existence of an edge between two nodes
depends on the features they share. Features are chosen according to a process
of Indian-Buffet type but with an additional random "fitness" parameter
attached to each node, that determines its ability to transmit its own features
to other nodes. As a consequence, a node's connectivity does not depend on its
age alone, so also "young" nodes are able to compete and succeed in acquiring
links. One of the advantages of our model for the latent bipartite
"node-attribute" network is that it depends on few parameters with a
straightforward interpretation. We provide some theoretical, as well
experimental, results regarding the power-law behaviour of the model and the
estimation of the parameters. By experimental data, we also show how the
proposed model for the attribute structure naturally captures most local and
global properties (e.g., degree distributions, connectivity and distance
distributions) real networks exhibit. keyword: Complex network, social network,
attribute matrix, Indian Buffet processComment: 34 pages, second version (date of the first version: July, 2014).
Submitte
HyperANF: Approximating the Neighbourhood Function of Very Large Graphs on a Budget
The neighbourhood function N(t) of a graph G gives, for each t, the number of
pairs of nodes such that y is reachable from x in less that t hops. The
neighbourhood function provides a wealth of information about the graph (e.g.,
it easily allows one to compute its diameter), but it is very expensive to
compute it exactly. Recently, the ANF algorithm (approximate neighbourhood
function) has been proposed with the purpose of approximating NG(t) on large
graphs. We describe a breakthrough improvement over ANF in terms of speed and
scalability. Our algorithm, called HyperANF, uses the new HyperLogLog counters
and combines them efficiently through broadword programming; our implementation
uses overdecomposition to exploit multi-core parallelism. With HyperANF, for
the first time we can compute in a few hours the neighbourhood function of
graphs with billions of nodes with a small error and good confidence using a
standard workstation. Then, we turn to the study of the distribution of the
shortest paths between reachable nodes (that can be efficiently approximated by
means of HyperANF), and discover the surprising fact that its index of
dispersion provides a clear-cut characterisation of proper social networks vs.
web graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a
graph as a new, informative statistics that is able to discriminate between the
above two types of graphs. We believe this is the first proposal of a
significant new non-local structural index for complex networks whose
computation is highly scalable
- …