17,447 research outputs found
Stochastic Query Covering for Fast Approximate Document Retrieval
We design algorithms that, given a collection of documents and a distribution over user queries, return a
small subset of the document collection in such a way that we can efficiently provide high-quality answers
to user queries using only the selected subset. This approach has applications when space is a constraint
or when the query-processing time increases significantly with the size of the collection. We study our
algorithms through the lens of stochastic analysis and prove that even though they use only a small fraction
of the entire collection, they can provide answers to most user queries, achieving a performance close to the
optimal. To complement our theoretical findings, we experimentally show the versatility of our approach
by considering two important cases in the context of Web search. In the first case, we favor the retrieval of
documents that are relevant to the query, whereas in the second case we aim for document diversification.
Both the theoretical and the experimental analysis provide strong evidence of the potential value of query
covering in diverse application scenarios
QuPARA: Query-Driven Large-Scale Portfolio Aggregate Risk Analysis on MapReduce
Stochastic simulation techniques are used for portfolio risk analysis. Risk
portfolios may consist of thousands of reinsurance contracts covering millions
of insured locations. To quantify risk each portfolio must be evaluated in up
to a million simulation trials, each capturing a different possible sequence of
catastrophic events over the course of a contractual year. In this paper, we
explore the design of a flexible framework for portfolio risk analysis that
facilitates answering a rich variety of catastrophic risk queries. Rather than
aggregating simulation data in order to produce a small set of high-level risk
metrics efficiently (as is often done in production risk management systems),
the focus here is on allowing the user to pose queries on unaggregated or
partially aggregated data. The goal is to provide a flexible framework that can
be used by analysts to answer a wide variety of unanticipated but natural ad
hoc queries. Such detailed queries can help actuaries or underwriters to better
understand the multiple dimensions (e.g., spatial correlation, seasonality,
peril features, construction features, and financial terms) that can impact
portfolio risk. We implemented a prototype system, called QuPARA (Query-Driven
Large-Scale Portfolio Aggregate Risk Analysis), using Hadoop, which is Apache's
implementation of the MapReduce paradigm. This allows the user to take
advantage of large parallel compute servers in order to answer ad hoc risk
analysis queries efficiently even on very large data sets typically encountered
in practice. We describe the design and implementation of QuPARA and present
experimental results that demonstrate its feasibility. A full portfolio risk
analysis run consisting of a 1,000,000 trial simulation, with 1,000 events per
trial, and 3,200 risk transfer contracts can be completed on a 16-node Hadoop
cluster in just over 20 minutes.Comment: 9 pages, IEEE International Conference on Big Data (BigData), Santa
Clara, USA, 201
Interactive Submodular Set Cover
We introduce a natural generalization of submodular set cover and exact
active learning with a finite hypothesis class (query learning). We call this
new problem interactive submodular set cover. Applications include advertising
in social networks with hidden information. We give an approximation guarantee
for a novel greedy algorithm and give a hardness of approximation result which
matches up to constant factors. We also discuss negative results for simpler
approaches and present encouraging early experimental results.Comment: 15 pages, 1 figur
Stochastic Analysis of a Churn-Tolerant Structured Peer-to-Peer Scheme
We present and analyze a simple and general scheme to build a churn
(fault)-tolerant structured Peer-to-Peer (P2P) network. Our scheme shows how to
"convert" a static network into a dynamic distributed hash table(DHT)-based P2P
network such that all the good properties of the static network are guaranteed
with high probability (w.h.p). Applying our scheme to a cube-connected cycles
network, for example, yields a degree connected network, in which
every search succeeds in hops w.h.p., using messages,
where is the expected stable network size. Our scheme has an constant
storage overhead (the number of nodes responsible for servicing a data item)
and an overhead (messages and time) per insertion and essentially
no overhead for deletions. All these bounds are essentially optimal. While DHT
schemes with similar guarantees are already known in the literature, this work
is new in the following aspects:
(1) It presents a rigorous mathematical analysis of the scheme under a
general stochastic model of churn and shows the above guarantees;
(2) The theoretical analysis is complemented by a simulation-based analysis
that validates the asymptotic bounds even in moderately sized networks and also
studies performance under changing stable network size;
(3) The presented scheme seems especially suitable for maintaining dynamic
structures under churn efficiently. In particular, we show that a spanning tree
of low diameter can be efficiently maintained in constant time and logarithmic
number of messages per insertion or deletion w.h.p.
Keywords: P2P Network, DHT Scheme, Churn, Dynamic Spanning Tree, Stochastic
Analysis
Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover
Stochastic Boolean Function Evaluation is the problem of determining the
value of a given Boolean function f on an unknown input x, when each bit of x_i
of x can only be determined by paying an associated cost c_i. The assumption is
that x is drawn from a given product distribution, and the goal is to minimize
the expected cost. This problem has been studied in Operations Research, where
it is known as "sequential testing" of Boolean functions. It has also been
studied in learning theory in the context of learning with attribute costs. We
consider the general problem of developing approximation algorithms for
Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for
evaluating Boolean linear threshold formulas. We also present an approximation
algorithm for evaluating CDNF formulas (and decision trees) achieving a factor
of O(log kd), where k is the number of terms in the DNF formula, and d is the
number of clauses in the CNF formula. In addition, we present approximation
algorithms for simultaneous evaluation of linear threshold functions, and for
ranking of linear functions.
Our function evaluation algorithms are based on reductions to the Stochastic
Submodular Set Cover (SSSC) problem. This problem was introduced by Golovin and
Krause. They presented an approximation algorithm for the problem, called
Adaptive Greedy. Our main technical contribution is a new approximation
algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an
extension of the Dual Greedy algorithm for Submodular Set Cover due to Fujito,
which is a generalization of Hochbaum's algorithm for the classical Set Cover
Problem. We also give a new bound on the approximation achieved by the Adaptive
Greedy algorithm of Golovin and Krause
Knowledge Spaces and Learning Spaces
How to design automated procedures which (i) accurately assess the knowledge
of a student, and (ii) efficiently provide advices for further study? To
produce well-founded answers, Knowledge Space Theory relies on a combinatorial
viewpoint on the assessment of knowledge, and thus departs from common,
numerical evaluation. Its assessment procedures fundamentally differ from other
current ones (such as those of S.A.T. and A.C.T.). They are adaptative (taking
into account the possible correctness of previous answers from the student) and
they produce an outcome which is far more informative than a crude numerical
mark. This chapter recapitulates the main concepts underlying Knowledge Space
Theory and its special case, Learning Space Theory. We begin by describing the
combinatorial core of the theory, in the form of two basic axioms and the main
ensuing results (most of which we give without proofs). In practical
applications, learning spaces are huge combinatorial structures which may be
difficult to manage. We outline methods providing efficient and comprehensive
summaries of such large structures. We then describe the probabilistic part of
the theory, especially the Markovian type processes which are instrumental in
uncovering the knowledge states of individuals. In the guise of the ALEKS
system, which includes a teaching component, these methods have been used by
millions of students in schools and colleges, and by home schooled students. We
summarize some of the results of these applications
- …