13,202 research outputs found
TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs
Given a large graph, how can we determine similarity between nodes in a fast
and accurate way? Random walk with restart (RWR) is a popular measure for this
purpose and has been exploited in numerous data mining applications including
ranking, anomaly detection, link prediction, and community detection. However,
previous methods for computing exact RWR require prohibitive storage sizes and
computational costs, and alternative methods which avoid such costs by
computing approximate RWR have limited accuracy. In this paper, we propose TPA,
a fast, scalable, and highly accurate method for computing approximate RWR on
large graphs. TPA exploits two important properties in RWR: 1) nodes close to a
seed node are likely to be revisited in following steps due to block-wise
structure of many real-world graphs, and 2) RWR scores of nodes which reside
far from the seed node are proportional to their PageRank scores. Based on
these two properties, TPA divides approximate RWR problem into two subproblems
called neighbor approximation and stranger approximation. In the neighbor
approximation, TPA estimates RWR scores of nodes close to the seed based on
scores of few early steps from the seed. In the stranger approximation, TPA
estimates RWR scores for nodes far from the seed using their PageRank. The
stranger and neighbor approximations are conducted in the preprocessing phase
and the online phase, respectively. Through extensive experiments, we show that
TPA requires up to 3.5x less time with up to 40x less memory space than other
state-of-the-art methods for the preprocessing phase. In the online phase, TPA
computes approximate RWR up to 30x faster than existing methods while
maintaining high accuracy.Comment: 12pages, 10 figure
Pseudorandom number generators revisited
Statistical Methods;mathematische statistiek
Efficient Seeds Computation Revisited
The notion of the cover is a generalization of a period of a string, and
there are linear time algorithms for finding the shortest cover. The seed is a
more complicated generalization of periodicity, it is a cover of a superstring
of a given string, and the shortest seed problem is of much higher algorithmic
difficulty. The problem is not well understood, no linear time algorithm is
known. In the paper we give linear time algorithms for some of its versions ---
computing shortest left-seed array, longest left-seed array and checking for
seeds of a given length. The algorithm for the last problem is used to compute
the seed array of a string (i.e., the shortest seeds for all the prefixes of
the string) in time. We describe also a simpler alternative algorithm
computing efficiently the shortest seeds. As a by-product we obtain an
time algorithm checking if the shortest seed has length at
least and finding the corresponding seed. We also correct some important
details missing in the previously known shortest-seed algorithm (Iliopoulos et
al., 1996).Comment: 14 pages, accepted to CPM 201
Low-shot learning with large-scale diffusion
This paper considers the problem of inferring image labels from images when
only a few annotated examples are available at training time. This setup is
often referred to as low-shot learning, where a standard approach is to
re-train the last few layers of a convolutional neural network learned on
separate classes for which training examples are abundant. We consider a
semi-supervised setting based on a large collection of images to support label
propagation. This is possible by leveraging the recent advances on large-scale
similarity graph construction.
We show that despite its conceptual simplicity, scaling label propagation up
to hundred millions of images leads to state of the art accuracy in the
low-shot learning regime
Interactive Channel Capacity Revisited
We provide the first capacity approaching coding schemes that robustly
simulate any interactive protocol over an adversarial channel that corrupts any
fraction of the transmitted symbols. Our coding schemes achieve a
communication rate of over any
adversarial channel. This can be improved to for
random, oblivious, and computationally bounded channels, or if parties have
shared randomness unknown to the channel.
Surprisingly, these rates exceed the interactive channel capacity bound
which [Kol and Raz; STOC'13] recently proved for random errors. We conjecture
and to be the optimal rates for their respective settings
and therefore to capture the interactive channel capacity for random and
adversarial errors.
In addition to being very communication efficient, our randomized coding
schemes have multiple other advantages. They are computationally efficient,
extremely natural, and significantly simpler than prior (non-capacity
approaching) schemes. In particular, our protocols do not employ any coding but
allow the original protocol to be performed as-is, interspersed only by short
exchanges of hash values. When hash values do not match, the parties backtrack.
Our approach is, as we feel, by far the simplest and most natural explanation
for why and how robust interactive communication in a noisy environment is
possible
Recursive Online Enumeration of All Minimal Unsatisfiable Subsets
In various areas of computer science, we deal with a set of constraints to be
satisfied. If the constraints cannot be satisfied simultaneously, it is
desirable to identify the core problems among them. Such cores are called
minimal unsatisfiable subsets (MUSes). The more MUSes are identified, the more
information about the conflicts among the constraints is obtained. However, a
full enumeration of all MUSes is in general intractable due to the large number
(even exponential) of possible conflicts. Moreover, to identify MUSes
algorithms must test sets of constraints for their simultaneous satisfiabilty.
The type of the test depends on the application domains. The complexity of
tests can be extremely high especially for domains like temporal logics, model
checking, or SMT. In this paper, we propose a recursive algorithm that
identifies MUSes in an online manner (i.e., one by one) and can be terminated
at any time. The key feature of our algorithm is that it minimizes the number
of satisfiability tests and thus speeds up the computation. The algorithm is
applicable to an arbitrary constraint domain and its effectiveness demonstrates
itself especially in domains with expensive satisfiability checks. We benchmark
our algorithm against state of the art algorithm on Boolean and SMT constraint
domains and demonstrate that our algorithm really requires less satisfiability
tests and consequently finds more MUSes in given time limits
- …