2,017 research outputs found
Linking Datasets on Organizations Using Half A Billion Open Collaborated Records
Scholars studying organizations often work with multiple datasets lacking
shared unique identifiers or covariates. In such situations, researchers may
turn to approximate string matching methods to combine datasets. String
matching, although useful, faces fundamental challenges. Even when two strings
appear similar to humans, fuzzy matching often does not work because it fails
to adapt to the informativeness of the character combinations presented. Worse,
many entities have multiple names that are dissimilar (e.g., "Fannie Mae" and
"Federal National Mortgage Association"), a case where string matching has
little hope of succeeding. This paper introduces data from a prominent
employment-related networking site (LinkedIn) as a tool to address these
problems. We propose interconnected approaches to leveraging the massive amount
of information from LinkedIn regarding organizational name-to-name links. The
first approach builds a machine learning model for predicting matches from
character strings, treating the trillions of user-contributed organizational
name pairs as a training corpus: this approach constructs a string matching
metric that explicitly maximizes match probabilities. A second approach
identifies relationships between organization names using network
representations of the LinkedIn data. A third approach combines the first and
second. We document substantial improvements over fuzzy matching in
applications, making all methods accessible in open-source software
("LinkOrgs")
From which world is your graph?
Discovering statistical structure from links is a fundamental problem in the
analysis of social networks. Choosing a misspecified model, or equivalently, an
incorrect inference algorithm will result in an invalid analysis or even
falsely uncover patterns that are in fact artifacts of the model. This work
focuses on unifying two of the most widely used link-formation models: the
stochastic blockmodel (SBM) and the small world (or latent space) model (SWM).
Integrating techniques from kernel learning, spectral graph theory, and
nonlinear dimensionality reduction, we develop the first statistically sound
polynomial-time algorithm to discover latent patterns in sparse graphs for both
models. When the network comes from an SBM, the algorithm outputs a block
structure. When it is from an SWM, the algorithm outputs estimates of each
node's latent position.Comment: To appear in NIPS 201
- …