117,463 research outputs found
A Survey and Taxonomy of Graph Sampling
Graph sampling is a technique to pick a subset of vertices and/ or edges from
original graph. It has a wide spectrum of applications, e.g. survey hidden
population in sociology [54], visualize social graph [29], scale down Internet
AS graph [27], graph sparsification [8], etc. In some scenarios, the whole
graph is known and the purpose of sampling is to obtain a smaller graph. In
other scenarios, the graph is unknown and sampling is regarded as a way to
explore the graph. Commonly used techniques are Vertex Sampling, Edge Sampling
and Traversal Based Sampling. We provide a taxonomy of different graph sampling
objectives and graph sampling approaches. The relations between these
approaches are formally argued and a general framework to bridge theoretical
analysis and practical implementation is provided. Although being smaller in
size, sampled graphs may be similar to original graphs in some way. We are
particularly interested in what graph properties are preserved given a sampling
procedure. If some properties are preserved, we can estimate them on the
sampled graphs, which gives a way to construct efficient estimators. If one
algorithm relies on the perserved properties, we can expect that it gives
similar output on original and sampled graphs. This leads to a systematic way
to accelerate a class of graph algorithms. In this survey, we discuss both
classical text-book type properties and some advanced properties. The landscape
is tabularized and we see a lot of missing works in this field. Some
theoretical studies are collected in this survey and simple extensions are
made. Most previous numerical evaluation works come in an ad hoc fashion, i.e.
evaluate different type of graphs, different set of properties, and different
sampling algorithms. A systematical and neutral evaluation is needed to shed
light on further graph sampling studies
A traffic classification method using machine learning algorithm
Applying concepts of attack investigation in IT industry, this idea has been developed to design
a Traffic Classification Method using Data Mining techniques at the intersection of Machine
Learning Algorithm, Which will classify the normal and malicious traffic. This classification will
help to learn about the unknown attacks faced by IT industry. The notion of traffic classification
is not a new concept; plenty of work has been done to classify the network traffic for
heterogeneous application nowadays. Existing techniques such as (payload based, port based
and statistical based) have their own pros and cons which will be discussed in this
literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now
Generalized Approximate Survey Propagation for High-Dimensional Estimation
In Generalized Linear Estimation (GLE) problems, we seek to estimate a signal
that is observed through a linear transform followed by a component-wise,
possibly nonlinear and noisy, channel. In the Bayesian optimal setting,
Generalized Approximate Message Passing (GAMP) is known to achieve optimal
performance for GLE. However, its performance can significantly degrade
whenever there is a mismatch between the assumed and the true generative model,
a situation frequently encountered in practice. In this paper, we propose a new
algorithm, named Generalized Approximate Survey Propagation (GASP), for solving
GLE in the presence of prior or model mis-specifications. As a prototypical
example, we consider the phase retrieval problem, where we show that GASP
outperforms the corresponding GAMP, reducing the reconstruction threshold and,
for certain choices of its parameters, approaching Bayesian optimal
performance. Furthermore, we present a set of State Evolution equations that
exactly characterize the dynamics of GASP in the high-dimensional limit
Estimating Discrete Markov Models From Various Incomplete Data Schemes
The parameters of a discrete stationary Markov model are transition
probabilities between states. Traditionally, data consist in sequences of
observed states for a given number of individuals over the whole observation
period. In such a case, the estimation of transition probabilities is
straightforwardly made by counting one-step moves from a given state to
another. In many real-life problems, however, the inference is much more
difficult as state sequences are not fully observed, namely the state of each
individual is known only for some given values of the time variable. A review
of the problem is given, focusing on Monte Carlo Markov Chain (MCMC) algorithms
to perform Bayesian inference and evaluate posterior distributions of the
transition probabilities in this missing-data framework. Leaning on the
dependence between the rows of the transition matrix, an adaptive MCMC
mechanism accelerating the classical Metropolis-Hastings algorithm is then
proposed and empirically studied.Comment: 26 pages - preprint accepted in 20th February 2012 for publication in
Computational Statistics and Data Analysis (please cite the journal's paper
Simulation optimization: A review of algorithms and applications
Simulation Optimization (SO) refers to the optimization of an objective
function subject to constraints, both of which can be evaluated through a
stochastic simulation. To address specific features of a particular
simulation---discrete or continuous decisions, expensive or cheap simulations,
single or multiple outputs, homogeneous or heterogeneous noise---various
algorithms have been proposed in the literature. As one can imagine, there
exist several competing algorithms for each of these classes of problems. This
document emphasizes the difficulties in simulation optimization as compared to
mathematical programming, makes reference to state-of-the-art algorithms in the
field, examines and contrasts the different approaches used, reviews some of
the diverse applications that have been tackled by these methods, and
speculates on future directions in the field
A survey of discrete methods in (algebraic) statistics for networks
Sampling algorithms, hypergraph degree sequences, and polytopes play a
crucial role in statistical analysis of network data. This article offers a
brief overview of open problems in this area of discrete mathematics from the
point of view of a particular family of statistical models for networks called
exponential random graph models. The problems and underlying constructions are
also related to well-known concepts in commutative algebra and graph-theoretic
concepts in computer science. We outline a few lines of recent work that
highlight the natural connection between these fields and unify them into some
open problems. While these problems are often relevant in discrete mathematics
in their own right, the emphasis here is on statistical relevance with the hope
that these lines of research do not remain disjoint. Suggested specific open
problems and general research questions should advance algebraic statistics
theory as well as applied statistical tools for rigorous statistical analysis
of networks.Comment: Revised for clarity, minor updates, added example, upon suggestions
of people mentioned in the acknowledgements sectio
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
An automated technique has recently been proposed to transfer learning in the
hierarchical Bayesian optimization algorithm (hBOA) based on distance-based
statistics. The technique enables practitioners to improve hBOA efficiency by
collecting statistics from probabilistic models obtained in previous hBOA runs
and using the obtained statistics to bias future hBOA runs on similar problems.
The purpose of this paper is threefold: (1) test the technique on several
classes of NP-complete problems, including MAXSAT, spin glasses and minimum
vertex cover; (2) demonstrate that the technique is effective even when
previous runs were done on problems of different size; (3) provide empirical
evidence that combining transfer learning with other efficiency enhancement
techniques can often yield nearly multiplicative speedups.Comment: Accepted at Parallel Problem Solving from Nature (PPSN XII), 10
pages. arXiv admin note: substantial text overlap with arXiv:1201.224
Mixture Models and Networks -- Overview of Stochastic Blockmodelling
Mixture models are probabilistic models aimed at uncovering and representing
latent subgroups within a population. In the realm of network data analysis,
the latent subgroups of nodes are typically identified by their connectivity
behaviour, with nodes behaving similarly belonging to the same community. In
this context, mixture modelling is pursued through stochastic blockmodelling.
We consider stochastic blockmodels and some of their variants and extensions
from a mixture modelling perspective. We also survey some of the main classes
of estimation methods available, and propose an alternative approach. In
addition to the discussion of inferential properties and estimating procedures,
we focus on the application of the models to several real-world network
datasets, showcasing the advantages and pitfalls of different approaches.Comment: 23 pages, 5 figure
Thompson Sampling for Dynamic Pricing
In this paper we apply active learning algorithms for dynamic pricing in a
prominent e-commerce website. Dynamic pricing involves changing the price of
items on a regular basis, and uses the feedback from the pricing decisions to
update prices of the items. Most popular approaches to dynamic pricing use a
passive learning approach, where the algorithm uses historical data to learn
various parameters of the pricing problem, and uses the updated parameters to
generate a new set of prices. We show that one can use active learning
algorithms such as Thompson sampling to more efficiently learn the underlying
parameters in a pricing problem. We apply our algorithms to a real e-commerce
system and show that the algorithms indeed improve revenue compared to pricing
algorithms that use passive learning
- …