457 research outputs found
Vertex nomination schemes for membership prediction
Suppose that a graph is realized from a stochastic block model where one of
the blocks is of interest, but many or all of the vertices' block labels are
unobserved. The task is to order the vertices with unobserved block labels into
a ``nomination list'' such that, with high probability, vertices from the
interesting block are concentrated near the list's beginning. We propose
several vertex nomination schemes. Our basic - but principled - setting and
development yields a best nomination scheme (which is a Bayes-Optimal
analogue), and also a likelihood maximization nomination scheme that is
practical to implement when there are a thousand vertices, and which is
empirically near-optimal when the number of vertices is small enough to allow
comparison to the best nomination scheme. We then illustrate the robustness of
the likelihood maximization nomination scheme to the modeling challenges
inherent in real data, using examples which include a social network involving
human trafficking, the Enron Graph, a worm brain connectome and a political
blog network.Comment: Published at http://dx.doi.org/10.1214/15-AOAS834 in the Annals of
  Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
  Mathematical Statistics (http://www.imstat.org
The Discrete Infinite Logistic Normal Distribution
We present the discrete infinite logistic normal distribution (DILN), a
Bayesian nonparametric prior for mixed membership models. DILN is a
generalization of the hierarchical Dirichlet process (HDP) that models
correlation structure between the weights of the atoms at the group level. We
derive a representation of DILN as a normalized collection of gamma-distributed
random variables, and study its statistical properties. We consider
applications to topic modeling and derive a variational inference algorithm for
approximate posterior inference. We study the empirical performance of the DILN
topic model on four corpora, comparing performance with the HDP and the
correlated topic model (CTM). To deal with large-scale data sets, we also
develop an online inference algorithm for DILN and compare with online HDP and
online LDA on the Nature magazine, which contains approximately 350,000
articles.Comment: This paper will appear in Bayesian Analysis. A shorter version of
  this paper appeared at AISTATS 2011, Fort Lauderdale, FL, US
Bayesian Restricted Likelihood Methods: Conditioning on Insufficient Statistics in Bayesian Regression
Bayesian methods have proven themselves to be successful across a wide range
of scientific problems and have many well-documented advantages over competing
methods. However, these methods run into difficulties for two major and
prevalent classes of problems: handling data sets with outliers and dealing
with model misspecification. We outline the drawbacks of previous solutions to
both of these problems and propose a new method as an alternative. When working
with the new method, the data is summarized through a set of insufficient
statistics, targeting inferential quantities of interest, and the prior
distribution is updated with the summary statistics rather than the complete
data. By careful choice of conditioning statistics, we retain the main benefits
of Bayesian methods while reducing the sensitivity of the analysis to features
of the data not captured by the conditioning statistics. For reducing
sensitivity to outliers, classical robust estimators (e.g., M-estimators) are
natural choices for conditioning statistics. A major contribution of this work
is the development of a data augmented Markov chain Monte Carlo (MCMC)
algorithm for the linear model and a large class of summary statistics. We
demonstrate the method on simulated and real data sets containing outliers and
subject to model misspecification. Success is manifested in better predictive
performance for data points of interest as compared to competing methods
Influence networks
Some behaviors, ideas or technologies spread and become persistent in society, whereas others vanish. This paper analyzes the role of social influence in determining such distinct collective outcomes. Agents are assumed to acquire information from others through a certain sampling process that generates an influence network, and they use simple rules to decide whether to adopt or not depending on the observed sample. We characterize, as a function of the primitives of the model, the diffusion threshold (i.e., the spreading rate above which the adoption of the new behavior becomes persistent in the population) and the endemic state (i.e., the fraction of adopters in the stationary state of the dynamics). We find that the new behavior will easily spread in the population if there is a high correlation between how influential (visible) and how easily influenced an agent is, which is determined by the sampling process and the adoption rule. We also analyze how the density and variance of the out-degree distribution affect the diffusion threshold and the endemic state.social influence, networks, diffusion threshold, endemic state
- …
