534 research outputs found
Learning a mixture of two multinomial logits
The classical Multinomial Logit (MNL) is a behavioral model for user choice. In this model, a user is offered a slate of choices (a subset of a finite universe of n items), and selects exactly one item from the slate, each with probability proportional to its (positive) weight. Given a set of observed slates and choices, the likelihood-maximizing item weights are easy to learn at scale, and easy to interpret. However, the model fails to represent common real-world behavior. As a result, researchers in user choice often turn to mixtures of MNLs, which are known to approximate a large class of models of rational user behavior. Unfortunately, the only known algorithms for this problem have been heuristic in nature. In this paper we give the first polynomial-time algorithms for exact learning of uniform mixtures of two MNLs. Interestingly, the parameters of the model can be learned for any n by sampling the behavior of random users only on slates of sizes 2 and 3; in contrast, we show that slates of size 2 are insufficient by themselves
On the power laws of language: word frequency distributions
About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s)
Discrete choice, permutations, and reconstruction
In this paper we study the well-known family of Random Utility Models, developed over 50 years ago to codify rational user behavior in choosing one item from a finite set of options. In this setting each user draws i.i.d. from some distribution a utility function mapping each item in the universe to a real-valued utility. The user is then offered a subset of the items, and selects theone of maximum utility. A Max-Dist oracle for this choice model takes any subset of items and returns the probability (over the distribution of utility functions) that each will be selected. A discrete choice algorithm, given access to a Max-Dist oracle, must return a function that approximates the oracle. We show three primary results. First, we show that any algorithm exactly reproducing the oracle must make exponentially many queries. Second, we show an equivalent representation of the distribution over utility functions, based on permutations, and show that if this distribution has support size k, then it is possible to approximate the oracle using O(nk) queries. Finally, we consider settings in which the subset of items is always small. We give an algorithm that makes less than n(1=2)K queries, each to sets of size at most (1/2)K, in order to approximate the Max-Dist oracle on every set of size |T| K with statistical error at most. In contrast, we show that any algorithm that queries for subsets of size 2O( p log n) must make maximal statistical error on some large sets
Descortesía en las páginas de Facebook de festivales de música
El presente artículo se centra en las interacciones que se desarrollan en Facebook (FB), el conocido sitio web de redes sociales que también es un recurso para la comunicación, y la promoción turística. Nos planteamos caracterizar este contexto sociocultural específico, que abarca comportamientos, actitudes y valores conocidos, aceptados y practicados en una comunidad discursiva, para luego describir el fenómeno de la descortesía en un corpus acotado de páginas de Facebook de festivales musicales, ofreciendo algunas reflexiones sobre sus características y sus funciones.This article focuses on interaction in Facebook (FB), as one of the best known and effective social media networks for marketing in tourism communication and industry. After characterizing and describing the specific sociocultural context, which includes behaviors, attitudes and values as accepted and practiced in a discourse community, the article describes the phenomenon of impoliteness in a limited corpus of music festivals, offering some reflections on its features and functions
Fair Clustering Through Fairlets
We study the question of fair clustering under the {\em disparate impact}
doctrine, where each protected class must have approximately equal
representation in every cluster. We formulate the fair clustering problem under
both the -center and the -median objectives, and show that even with two
protected classes the problem is challenging, as the optimum solution can
violate common conventions---for instance a point may no longer be assigned to
its nearest cluster center! En route we introduce the concept of fairlets,
which are minimal sets that satisfy fair representation while approximately
preserving the clustering objective. We show that any fair clustering problem
can be decomposed into first finding good fairlets, and then using existing
machinery for traditional clustering algorithms. While finding good fairlets
can be NP-hard, we proceed to obtain efficient approximation algorithms based
on minimum cost flow. We empirically quantify the value of fair clustering on
real-world datasets with sensitive attributes
Motif counting beyond five nodes
Counting graphlets is a well-studied problem in graph mining and social network analysis. Recently, several papers explored very simple and natural algorithms based on Monte Carlo sampling of Markov Chains (MC), and reported encouraging results. We show, perhaps surprisingly, that such algorithms are outperformed by color coding (CC) [2], a sophisticated algorithmic technique that we extend to the case of graphlet sampling and for which we prove strong statistical guarantees. Our computational experiments on graphs with millions of nodes show CC to be more accurate than MC; furthermore, we formally show that the mixing time of the MC approach is too high in general, even when the input graph has high conductance. All this comes at a price however. While MC is very efficient in terms of space, CC’s memory requirements become demanding when the size of the input graph and that of the graphlets grow. And yet, our experiments show that CC can push the limits of the state-of-the-art, both in terms of the size of the input graph and of that of the graphlets
On sampling nodes in a network
Random walk is an important tool in many graph mining applications including estimating graph parameters, sampling portions of the graph, and extracting dense communities. In this paper we consider the problem of sampling nodes from a large graph according to a prescribed distribution by using random walk as the basic primitive. Our goal is to obtain algorithms that make a small number of queries to the graph but output a node that is sampled according to the prescribed distribution. Focusing on the uniform distribution case, we study the query complexity of three algorithms and show a near-tight bound expressed in terms of the parameters of the graph such as average degree and the mixing time. Both theoretically and empirically, we show that some algorithms are preferable in practice than the others. We also extend our study to the problem of sampling nodes according to some polynomial function of their degrees; this has implications for designing efficient algorithms for applications such as triangle counting
Voting with Limited Information and Many Alternatives
The traditional axiomatic approach to voting is motivated by the problem of
reconciling differences in subjective preferences. In contrast, a dominant line
of work in the theory of voting over the past 15 years has considered a
different kind of scenario, also fundamental to voting, in which there is a
genuinely "best" outcome that voters would agree on if they only had enough
information. This type of scenario has its roots in the classical Condorcet
Jury Theorem; it includes cases such as jurors in a criminal trial who all want
to reach the correct verdict but disagree in their inferences from the
available evidence, or a corporate board of directors who all want to improve
the company's revenue, but who have different information that favors different
options.
This style of voting leads to a natural set of questions: each voter has a
{\em private signal} that provides probabilistic information about which option
is best, and a central question is whether a simple plurality voting system,
which tabulates votes for different options, can cause the group decision to
arrive at the correct option. We show that plurality voting is powerful enough
to achieve this: there is a way for voters to map their signals into votes for
options in such a way that --- with sufficiently many voters --- the correct
option receives the greatest number of votes with high probability. We show
further, however, that any process for achieving this is inherently expensive
in the number of voters it requires: succeeding in identifying the correct
option with probability at least requires voters, where is the number of options and is a
distributional measure of the minimum difference between the options
- …