'International Journal of Machine Learning and Networked Collaborative Engineering'
Abstract
Identifier attributes—very high-dimensional categorical attributes such as particular
product ids or people’s names—rarely are incorporated in statistical modeling. However,
they can play an important role in relational modeling: it may be informative to have communicated
with a particular set of people or to have purchased a particular set of products. A
key limitation of existing relational modeling techniques is how they aggregate bags (multisets)
of values from related entities. The aggregations used by existing methods are simple
summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM,
or COUNT. This paper’s main contribution is the introduction of aggregation operators that
capture more information about the value distributions, by storing meta-data about value
distributions and referencing this meta-data when aggregating—for example by computing
class-conditional distributional distances. Such aggregations are particularly important for
aggregating values from high-dimensional categorical attributes, for which the simple aggregates
provide little information. In the first half of the paper we provide general guidelines
for designing aggregation operators, introduce the new aggregators in the context of the
relational learning system ACORA (Automated Construction of Relational Attributes), and
provide theoretical justification.We also conjecture special properties of identifier attributes,
e.g., they proxy for unobserved attributes and for information deeper in the relationship
network. In the second half of the paper we provide extensive empirical evidence that the
distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical
attributes, and in support of the aforementioned conjectures.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc