2,623 research outputs found
Krylov Subspace Approximation for Local Community Detection in Large Networks
Community detection is an important information mining task to uncover
modular structures in large networks. For increasingly common large network
data sets, global community detection is prohibitively expensive, and attention
has shifted to methods that mine local communities, i.e. identifying all latent
members of a particular community from a few labeled seed members. To address
such semi-supervised mining task, we systematically develop a local spectral
subspace-based community detection method, called LOSP. We define a family of
local spectral subspaces based on Krylov subspaces, and seek a sparse indicator
for the target community via an norm minimization over the Krylov
subspace. Variants of LOSP depend on type of random walks with different
diffusion speeds, type of random walks, dimension of the local spectral
subspace and step of diffusions. The effectiveness of the proposed LOSP
approach is theoretically analyzed based on Rayleigh quotients, and it is
experimentally verified on a wide variety of real-world networks across social,
production and biological domains, as well as on an extensive set of synthetic
LFR benchmark datasets.Comment: Submitted to ACM Transactions on Knowledge Discovery from Data (under
revision
The Interhospital Transfer Network for Very Low Birth Weight Infants in the United States
Very low birth weight (VLBW) infants require specialized care in neonatal
intensive care units. In the United States (U.S.), such infants frequently are
transferred between hospitals. Although these neonatal transfer networks are
important, both economically and for infant morbidity and mortality, the
national-level pattern of neonatal transfers is largely unknown. Using data
from Vermont Oxford Network on 44,753 births, 2,122 hospitals, and 9,722
inter-hospital infant transfers from 2015, we performed the largest analysis to
date on the inter-hospital transfer network for VLBW infants in the U.S. We
find that transfers are organized around regional communities, but that despite
being largely within state boundaries, most communities often contain at least
two hospitals in different states. To classify the structural variation in
transfer pattern amongst these communities, we applied a spectral measure for
regionalization and found an association between a community's degree of
regionalization and their infant transfer rate, which was not utilized in
detecting communities. We also demonstrate that the established measures of
network centrality and hierarchy, e.g., the community-wide entropy in PageRank
or betweenness centrality and number of distinct `layers' within a community,
correlate weakly with our regionalization index and were not significantly
associated with metrics on infant transfer rate. Our results suggest that the
regionalization index captures novel information about the structural
properties of VLBW infant transfer networks, have the practical implication of
characterizing neonatal care in the U.S., and may apply more broadly to the
role of centralizing forces in organizing complex adaptive systems
Overlapping Community Detection Using Neighborhood-Inflated Seed Expansion
Community detection is an important task in network analysis. A community
(also referred to as a cluster) is a set of cohesive vertices that have more
connections inside the set than outside. In many social and information
networks, these communities naturally overlap. For instance, in a social
network, each vertex in a graph corresponds to an individual who usually
participates in multiple communities. In this paper, we propose an efficient
overlapping community detection algorithm using a seed expansion approach. The
key idea of our algorithm is to find good seeds, and then greedily expand these
seeds based on a community metric. Within this seed expansion method, we
investigate the problem of how to determine good seed nodes in a graph. In
particular, we develop new seeding strategies for a personalized PageRank
clustering scheme that optimizes the conductance community score. Experimental
results show that our seed expansion algorithm outperforms other
state-of-the-art overlapping community detection methods in terms of producing
cohesive clusters and identifying ground-truth communities. We also show that
our new seeding strategies are better than existing strategies, and are thus
effective in finding good overlapping communities in real-world networks
A Short Introduction to Local Graph Clustering Methods and Software
Graph clustering has many important applications in computing, but due to the
increasing sizes of graphs, even traditionally fast clustering methods can be
computationally expensive for real-world graphs of interest. Scalability
problems led to the development of local graph clustering algorithms that come
with a variety of theoretical guarantees. Rather than return a global
clustering of the entire graph, local clustering algorithms return a single
cluster around a given seed node or set of seed nodes. These algorithms improve
scalability because they use time and memory resources that depend only on the
size of the cluster returned, instead of the size of the input graph. Indeed,
for many of them, their running time grows linearly with the size of the
output. In addition to scalability arguments, local graph clustering algorithms
have proven to be very useful for identifying and interpreting small-scale and
meso-scale structure in large-scale graphs. As opposed to heuristic operational
procedures, this class of algorithms comes with strong algorithmic and
statistical theory. These include statistical guarantees that prove they have
implicit regularization properties. One of the challenges with the existing
literature on these approaches is that they are published in a wide variety of
areas, including theoretical computer science, statistics, data science, and
mathematics. This has made it difficult to relate the various algorithms and
ideas together into a cohesive whole. We have recently been working on unifying
these diverse perspectives through the lens of optimization as well as
providing software to perform these computations in a cohesive fashion. In this
note, we provide a brief introduction to local graph clustering, we provide
some representative examples of our perspective, and we introduce our software
named Local Graph Clustering (LGC).Comment: 3 pages, 2 figure
Parallel Local Graph Clustering
Graph clustering has many important applications in computing, but due to
growing sizes of graphs, even traditionally fast clustering methods such as
spectral partitioning can be computationally expensive for real-world graphs of
interest. Motivated partly by this, so-called local algorithms for graph
clustering have received significant interest due to the fact that they can
find good clusters in a graph with work proportional to the size of the cluster
rather than that of the entire graph. This feature has proven to be crucial in
making such graph clustering and many of its downstream applications efficient
in practice. While local clustering algorithms are already faster than
traditional algorithms that touch the entire graph, they are sequential and
there is an opportunity to make them even more efficient via parallelization.
In this paper, we show how to parallelize many of these algorithms in the
shared-memory multicore setting, and we analyze the parallel complexity of
these algorithms. We present comprehensive experiments on large-scale graphs
showing that our parallel algorithms achieve good parallel speedups on a modern
multicore machine, thus significantly speeding up the analysis of local graph
clusters in the very large-scale setting.Comment: Fixed typo in Figure
Overlapping Community Detection via Local Spectral Clustering
Large graphs arise in a number of contexts and understanding their structure
and extracting information from them is an important research area. Early
algorithms on mining communities have focused on the global structure, and
often run in time functional to the size of the entire graph. Nowadays, as we
often explore networks with billions of vertices and find communities of size
hundreds, it is crucial to shift our attention from macroscopic structure to
microscopic structure in large networks. A growing body of work has been
adopting local expansion methods in order to identify the community members
from a few exemplary seed members.
In this paper, we propose a novel approach for finding overlapping
communities called LEMON (Local Expansion via Minimum One Norm). The algorithm
finds the community by seeking a sparse vector in the span of the local spectra
such that the seeds are in its support. We show that LEMON can achieve the
highest detection accuracy among state-of-the-art proposals. The running time
depends on the size of the community rather than that of the entire graph. The
algorithm is easy to implement, and is highly parallelizable. We further
provide theoretical analysis on the local spectral properties, bounding the
measure of tightness of extracted community in terms of the eigenvalues of
graph Laplacian.
Moreover, given that networks are not all similar in nature, a comprehensive
analysis on how the local expansion approach is suited for uncovering
communities in different networks is still lacking. We thoroughly evaluate our
approach using both synthetic and real-world datasets across different domains,
and analyze the empirical variations when applying our method to inherently
different networks in practice. In addition, the heuristics on how the seed set
quality and quantity would affect the performance are provided.Comment: Extended version to the conference proceeding in WWW'1
Heat kernel based community detection
The heat kernel is a particular type of graph diffusion that, like the
much-used personalized PageRank diffusion, is useful in identifying a community
nearby a starting seed node. We present the first deterministic, local
algorithm to compute this diffusion and use that algorithm to study the
communities that it produces. Our algorithm is formally a relaxation method for
solving a linear system to estimate the matrix exponential in a degree-weighted
norm. We prove that this algorithm stays localized in a large graph and has a
worst-case constant runtime that depends only on the parameters of the
diffusion, not the size of the graph. Our experiments on real-world networks
indicate that the communities produced by this method have better conductance
than those produced by PageRank, although they take slightly longer to compute
on large graphs. On a real-world community identification task, the heat kernel
communities perform better than those from the PageRank diffusion.Comment: 10 pages, published in KDD2014 proceedings; Contains minor correction
to experiments from original versio
Leveraging local network communities to predict academic performance
For more than 20 years, social network analysis of student collaboration
networks has focused on a student's centrality to predict academic performance.
And even though a growing amount of sociological literature has supported that
academic success is contagious, identifying central students in the network
alone does not capture how peer interactions facilitate the spread of academic
success throughout the network. Consequently, we propose novel predictors that
treat academic success as a contagion by identifying a student's learning
community, consisting of the peers that are most likely to influence a
student's performance in a course. We evaluate the importance of these learning
communities by predicting academic outcomes in an introductory college
statistics course with 103 students. In particular, we observe that by
including these learning community predictors, the resulting model is 68 times
more likely to be the correct model than the current state-of-the-art
centrality network models in the literature.Comment: 12 pages, 5 figure
Inferring Fine-grained Details on User Activities and Home Location from Social Media: Detecting Drinking-While-Tweeting Patterns in Communities
Nearly all previous work on geo-locating latent states and activities from
social media confounds general discussions about activities, self-reports of
users participating in those activities at times in the past or future, and
self-reports made at the immediate time and place the activity occurs.
Activities, such as alcohol consumption, may occur at different places and
types of places, and it is important not only to detect the local regions where
these activities occur, but also to analyze the degree of participation in them
by local residents. In this paper, we develop new machine learning based
methods for fine-grained localization of activities and home locations from
Twitter data. We apply these methods to discover and compare alcohol
consumption patterns in a large urban area, New York City, and a more suburban
and rural area, Monroe County. We find positive correlations between the rate
of alcohol consumption reported among a community's Twitter users and the
density of alcohol outlets, demonstrating that the degree of correlation varies
significantly between urban and suburban areas. While our experiments are
focused on alcohol use, our methods for locating homes and distinguishing
temporally-specific self-reports are applicable to a broad range of behaviors
and latent states.Comment: 12 pages, 7 figures, 4-page poster version accepted at ICWSM 2016,
alcohol dataset and keywords available in:
cs.rochester.edu/u/nhossain/icwsm-16-data.zi
A Local Spectral Method for Graphs: with Applications to Improving Graph Partitions and Exploring Data Graphs Locally
The second eigenvalue of the Laplacian matrix and its associated eigenvector
are fundamental features of an undirected graph, and as such they have found
widespread use in scientific computing, machine learning, and data analysis. In
many applications, however, graphs that arise have several \emph{local} regions
of interest, and the second eigenvector will typically fail to provide
information fine-tuned to each local region. In this paper, we introduce a
locally-biased analogue of the second eigenvector, and we demonstrate its
usefulness at highlighting local properties of data graphs in a semi-supervised
manner. To do so, we first view the second eigenvector as the solution to a
constrained optimization problem, and we incorporate the local information as
an additional constraint; we then characterize the optimal solution to this new
problem and show that it can be interpreted as a generalization of a
Personalized PageRank vector; and finally, as a consequence, we show that the
solution can be computed in nearly-linear time. In addition, we show that this
locally-biased vector can be used to compute an approximation to the best
partition \emph{near} an input seed set in a manner analogous to the way in
which the second eigenvector of the Laplacian can be used to obtain an
approximation to the best partition in the entire input graph. Such a primitive
is useful for identifying and refining clusters locally, as it allows us to
focus on a local region of interest in a semi-supervised manner. Finally, we
provide a detailed empirical evaluation of our method by showing how it can
applied to finding locally-biased sparse cuts around an input vertex seed set
in social and information networks.Comment: 24 pages. Completely rewritten; substance is still the same, but the
presentation is reworke
- …