2,634 research outputs found
ACORA: Distribution-Based Aggregation for Relational Learning from Identifier Attributes
Feature construction through aggregation plays an essential role in modeling relational
domains with one-to-many relationships between tables. One-to-many relationships
lead to bags (multisets) of related entities, from which predictive information
must be captured. This paper focuses on aggregation from categorical attributes
that can take many values (e.g., object identifiers). We present a novel aggregation
method as part of a relational learning system ACORA, that combines the use of
vector distance and meta-data about the class-conditional distributions of attribute
values. We provide a theoretical foundation for this approach deriving a "relational
fixed-effect" model within a Bayesian framework, and discuss the implications of
identifier aggregation on the expressive power of the induced model. One advantage
of using identifier attributes is the circumvention of limitations caused either by
missing/unobserved object properties or by independence assumptions. Finally, we
show empirically that the novel aggregators can generalize in the presence of identi-
fier (and other high-dimensional) attributes, and also explore the limitations of the
applicability of the methods.Information Systems Working Papers Serie
Latent demographic profile estimation in hard-to-reach groups
The sampling frame in most social science surveys excludes members of certain
groups, known as hard-to-reach groups. These groups, or subpopulations, may be
difficult to access (the homeless, e.g.), camouflaged by stigma (individuals
with HIV/AIDS), or both (commercial sex workers). Even basic demographic
information about these groups is typically unknown, especially in many
developing nations. We present statistical models which leverage social network
structure to estimate demographic characteristics of these subpopulations using
Aggregated relational data (ARD), or questions of the form "How many X's do you
know?" Unlike other network-based techniques for reaching these groups, ARD
require no special sampling strategy and are easily incorporated into standard
surveys. ARD also do not require respondents to reveal their own group
membership. We propose a Bayesian hierarchical model for estimating the
demographic characteristics of hard-to-reach groups, or latent demographic
profiles, using ARD. We propose two estimation techniques. First, we propose a
Markov-chain Monte Carlo algorithm for existing data or cases where the full
posterior distribution is of interest. For cases when new data can be
collected, we propose guidelines and, based on these guidelines, propose a
simple estimate motivated by a missing data approach. Using data from McCarty
et al. [Human Organization 60 (2001) 28-39], we estimate the age and gender
profiles of six hard-to-reach groups, such as individuals who have HIV, women
who were raped, and homeless persons. We also evaluate our simple estimates
using simulation studies.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS569 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Distribution-based aggregation for relational learning with identifier attributes
Identifier attributes—very high-dimensional categorical attributes such as particular
product ids or people’s names—rarely are incorporated in statistical modeling. However,
they can play an important role in relational modeling: it may be informative to have communicated
with a particular set of people or to have purchased a particular set of products. A
key limitation of existing relational modeling techniques is how they aggregate bags (multisets)
of values from related entities. The aggregations used by existing methods are simple
summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM,
or COUNT. This paper’s main contribution is the introduction of aggregation operators that
capture more information about the value distributions, by storing meta-data about value
distributions and referencing this meta-data when aggregating—for example by computing
class-conditional distributional distances. Such aggregations are particularly important for
aggregating values from high-dimensional categorical attributes, for which the simple aggregates
provide little information. In the first half of the paper we provide general guidelines
for designing aggregation operators, introduce the new aggregators in the context of the
relational learning system ACORA (Automated Construction of Relational Attributes), and
provide theoretical justification.We also conjecture special properties of identifier attributes,
e.g., they proxy for unobserved attributes and for information deeper in the relationship
network. In the second half of the paper we provide extensive empirical evidence that the
distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical
attributes, and in support of the aforementioned conjectures.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Hierarchical Models for Relational Event Sequences
Interaction within small groups can often be represented as a sequence of
events, where each event involves a sender and a recipient. Recent methods for
modeling network data in continuous time model the rate at which individuals
interact conditioned on the previous history of events as well as actor
covariates. We present a hierarchical extension for modeling multiple such
sequences, facilitating inferences about event-level dynamics and their
variation across sequences. The hierarchical approach allows one to share
information across sequences in a principled manner---we illustrate the
efficacy of such sharing through a set of prediction experiments. After
discussing methods for adequacy checking and model selection for this class of
models, the method is illustrated with an analysis of high school classroom
dynamics
On the estimation of a fixed effects model with selective non-response
Economics;Statistical Methods;econometrics
- …