38 research outputs found
From Relational Data to Graphs: Inferring Significant Links using Generalized Hypergeometric Ensembles
The inference of network topologies from relational data is an important
problem in data analysis. Exemplary applications include the reconstruction of
social ties from data on human interactions, the inference of gene
co-expression networks from DNA microarray data, or the learning of semantic
relationships based on co-occurrences of words in documents. Solving these
problems requires techniques to infer significant links in noisy relational
data. In this short paper, we propose a new statistical modeling framework to
address this challenge. It builds on generalized hypergeometric ensembles, a
class of generative stochastic models that give rise to analytically tractable
probability spaces of directed, multi-edge graphs. We show how this framework
can be used to assess the significance of links in noisy relational data. We
illustrate our method in two data sets capturing spatio-temporal proximity
relations between actors in a social system. The results show that our
analytical framework provides a new approach to infer significant links from
relational data, with interesting perspectives for the mining of data on social
systems.Comment: 10 pages, 8 figures, accepted at SocInfo201
An Accuracy-Assured Privacy-Preserving Recommender System for Internet Commerce
Recommender systems, tool for predicting users' potential preferences by
computing history data and users' interests, show an increasing importance in
various Internet applications such as online shopping. As a well-known
recommendation method, neighbourhood-based collaborative filtering has
attracted considerable attention recently. The risk of revealing users' private
information during the process of filtering has attracted noticeable research
interests. Among the current solutions, the probabilistic techniques have shown
a powerful privacy preserving effect. When facing Nearest Neighbour attack,
all the existing methods provide no data utility guarantee, for the
introduction of global randomness. In this paper, to overcome the problem of
recommendation accuracy loss, we propose a novel approach, Partitioned
Probabilistic Neighbour Selection, to ensure a required prediction accuracy
while maintaining high security against NN attack. We define the sum of
neighbours' similarity as the accuracy metric alpha, the number of user
partitions, across which we select the neighbours, as the security metric
beta. We generalise the Nearest Neighbour attack to beta k Nearest
Neighbours attack. Differing from the existing approach that selects neighbours
across the entire candidate list randomly, our method selects neighbours from
each exclusive partition of size with a decreasing probability. Theoretical
and experimental analysis show that to provide an accuracy-assured
recommendation, our Partitioned Probabilistic Neighbour Selection method yields
a better trade-off between the recommendation accuracy and system security.Comment: replacement for the previous versio
Uses of the Hypergeometric Distribution for Determining Survival or Complete Representation of Subpopulations in Sequential Sampling
This thesis will explore the hypergeometric probability distribution by looking at many different aspects of the distribution. These include, and are not limited to: history and origin, derivation and elementary applications, properties, relationships to other probability models, kindred hypergeometric distributions and elements of statistical inference associated with the hypergeometric distribution. Once the above are established, an investigation into and furthering of work done by Walton (1986) and Charlambides (2005) will be done. Here, we apply the hypergeometric distribution to sequential sampling in order to determine a surviving subcategory as well as study the problem of and complete representation of the subcategories within the population
First-Come-First-Served for Online Slot Allocation and Huffman Coding
Can one choose a good Huffman code on the fly, without knowing the underlying
distribution? Online Slot Allocation (OSA) models this and similar problems:
There are n slots, each with a known cost. There are n items. Requests for
items are drawn i.i.d. from a fixed but hidden probability distribution p.
After each request, if the item, i, was not previously requested, then the
algorithm (knowing the slot costs and the requests so far, but not p) must
place the item in some vacant slot j(i). The goal is to minimize the sum, over
the items, of the probability of the item times the cost of its assigned slot.
The optimal offline algorithm is trivial: put the most probable item in the
cheapest slot, the second most probable item in the second cheapest slot, etc.
The optimal online algorithm is First Come First Served (FCFS): put the first
requested item in the cheapest slot, the second (distinct) requested item in
the second cheapest slot, etc. The optimal competitive ratios for any online
algorithm are 1+H(n-1) ~ ln n for general costs and 2 for concave costs. For
logarithmic costs, the ratio is, asymptotically, 1: FCFS gives cost opt + O(log
opt).
For Huffman coding, FCFS yields an online algorithm (one that allocates
codewords on demand, without knowing the underlying probability distribution)
that guarantees asymptotically optimal cost: at most opt + 2 log(1+opt) + 2.Comment: ACM-SIAM Symposium on Discrete Algorithms (SODA) 201
Wallenius Naive Bayes
Traditional event models underlying naive Bayes classifiers assume probability distributions that are not appropriate for binary data generated by human behaviour. In this work, we develop a new event model, based on a somewhat forgotten distribution created by Kenneth Ted Wallenius in 1963. We show that it achieves superior performance using less data on a collection of Facebook datasets, where the task is to predict personality traits, based on likes.Faculty of Applied Economics, University of Antwerp, Belgium; Department of Information, Operations & Management Sciences, NYU Stern School of Busines
Modelling Preference Data with the Wallenius Distribution
The Wallenius distribution is a generalisation of the Hypergeometric distribution where weights are assigned to balls of different colours. This naturally defines a model
for ranking categories which can be used for classification purposes. Since, in general, the resulting likelihood is not analytically available, we adopt an approximate Bayesian
computational (ABC) approach for estimating the importance of the categories. We illustrate the performance of the estimation procedure on simulated datasets. Finally,
we use the new model for analysing two datasets concerning movies ratings and Italian academic statisticians' journal preferences. The latter is a novel dataset collected by
the authors
Some Objects Are More Equal Than Others: Measuring and Predicting Importance
We observe that everyday images contain dozens of objects, and that humans, in describing these images, give different priority to these objects. We argue that a goal of visual recognition is, therefore, not only to detect and classify objects but also to associate with each a level of priority which we call 'importance'. We propose a definition of importance and show how this may be estimated reliably from data harvested from human observers. We conclude by showing that a first-order estimate of importance may be computed from a number of simple image region measurements and does not require access to image meaning