2,623 research outputs found

    Krylov Subspace Approximation for Local Community Detection in Large Networks

    Full text link
    Community detection is an important information mining task to uncover modular structures in large networks. For increasingly common large network data sets, global community detection is prohibitively expensive, and attention has shifted to methods that mine local communities, i.e. identifying all latent members of a particular community from a few labeled seed members. To address such semi-supervised mining task, we systematically develop a local spectral subspace-based community detection method, called LOSP. We define a family of local spectral subspaces based on Krylov subspaces, and seek a sparse indicator for the target community via an â„“1\ell_1 norm minimization over the Krylov subspace. Variants of LOSP depend on type of random walks with different diffusion speeds, type of random walks, dimension of the local spectral subspace and step of diffusions. The effectiveness of the proposed LOSP approach is theoretically analyzed based on Rayleigh quotients, and it is experimentally verified on a wide variety of real-world networks across social, production and biological domains, as well as on an extensive set of synthetic LFR benchmark datasets.Comment: Submitted to ACM Transactions on Knowledge Discovery from Data (under revision

    The Interhospital Transfer Network for Very Low Birth Weight Infants in the United States

    Full text link
    Very low birth weight (VLBW) infants require specialized care in neonatal intensive care units. In the United States (U.S.), such infants frequently are transferred between hospitals. Although these neonatal transfer networks are important, both economically and for infant morbidity and mortality, the national-level pattern of neonatal transfers is largely unknown. Using data from Vermont Oxford Network on 44,753 births, 2,122 hospitals, and 9,722 inter-hospital infant transfers from 2015, we performed the largest analysis to date on the inter-hospital transfer network for VLBW infants in the U.S. We find that transfers are organized around regional communities, but that despite being largely within state boundaries, most communities often contain at least two hospitals in different states. To classify the structural variation in transfer pattern amongst these communities, we applied a spectral measure for regionalization and found an association between a community's degree of regionalization and their infant transfer rate, which was not utilized in detecting communities. We also demonstrate that the established measures of network centrality and hierarchy, e.g., the community-wide entropy in PageRank or betweenness centrality and number of distinct `layers' within a community, correlate weakly with our regionalization index and were not significantly associated with metrics on infant transfer rate. Our results suggest that the regionalization index captures novel information about the structural properties of VLBW infant transfer networks, have the practical implication of characterizing neonatal care in the U.S., and may apply more broadly to the role of centralizing forces in organizing complex adaptive systems

    Overlapping Community Detection Using Neighborhood-Inflated Seed Expansion

    Full text link
    Community detection is an important task in network analysis. A community (also referred to as a cluster) is a set of cohesive vertices that have more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this paper, we propose an efficient overlapping community detection algorithm using a seed expansion approach. The key idea of our algorithm is to find good seeds, and then greedily expand these seeds based on a community metric. Within this seed expansion method, we investigate the problem of how to determine good seed nodes in a graph. In particular, we develop new seeding strategies for a personalized PageRank clustering scheme that optimizes the conductance community score. Experimental results show that our seed expansion algorithm outperforms other state-of-the-art overlapping community detection methods in terms of producing cohesive clusters and identifying ground-truth communities. We also show that our new seeding strategies are better than existing strategies, and are thus effective in finding good overlapping communities in real-world networks

    A Short Introduction to Local Graph Clustering Methods and Software

    Full text link
    Graph clustering has many important applications in computing, but due to the increasing sizes of graphs, even traditionally fast clustering methods can be computationally expensive for real-world graphs of interest. Scalability problems led to the development of local graph clustering algorithms that come with a variety of theoretical guarantees. Rather than return a global clustering of the entire graph, local clustering algorithms return a single cluster around a given seed node or set of seed nodes. These algorithms improve scalability because they use time and memory resources that depend only on the size of the cluster returned, instead of the size of the input graph. Indeed, for many of them, their running time grows linearly with the size of the output. In addition to scalability arguments, local graph clustering algorithms have proven to be very useful for identifying and interpreting small-scale and meso-scale structure in large-scale graphs. As opposed to heuristic operational procedures, this class of algorithms comes with strong algorithmic and statistical theory. These include statistical guarantees that prove they have implicit regularization properties. One of the challenges with the existing literature on these approaches is that they are published in a wide variety of areas, including theoretical computer science, statistics, data science, and mathematics. This has made it difficult to relate the various algorithms and ideas together into a cohesive whole. We have recently been working on unifying these diverse perspectives through the lens of optimization as well as providing software to perform these computations in a cohesive fashion. In this note, we provide a brief introduction to local graph clustering, we provide some representative examples of our perspective, and we introduce our software named Local Graph Clustering (LGC).Comment: 3 pages, 2 figure

    Parallel Local Graph Clustering

    Full text link
    Graph clustering has many important applications in computing, but due to growing sizes of graphs, even traditionally fast clustering methods such as spectral partitioning can be computationally expensive for real-world graphs of interest. Motivated partly by this, so-called local algorithms for graph clustering have received significant interest due to the fact that they can find good clusters in a graph with work proportional to the size of the cluster rather than that of the entire graph. This feature has proven to be crucial in making such graph clustering and many of its downstream applications efficient in practice. While local clustering algorithms are already faster than traditional algorithms that touch the entire graph, they are sequential and there is an opportunity to make them even more efficient via parallelization. In this paper, we show how to parallelize many of these algorithms in the shared-memory multicore setting, and we analyze the parallel complexity of these algorithms. We present comprehensive experiments on large-scale graphs showing that our parallel algorithms achieve good parallel speedups on a modern multicore machine, thus significantly speeding up the analysis of local graph clusters in the very large-scale setting.Comment: Fixed typo in Figure

    Overlapping Community Detection via Local Spectral Clustering

    Full text link
    Large graphs arise in a number of contexts and understanding their structure and extracting information from them is an important research area. Early algorithms on mining communities have focused on the global structure, and often run in time functional to the size of the entire graph. Nowadays, as we often explore networks with billions of vertices and find communities of size hundreds, it is crucial to shift our attention from macroscopic structure to microscopic structure in large networks. A growing body of work has been adopting local expansion methods in order to identify the community members from a few exemplary seed members. In this paper, we propose a novel approach for finding overlapping communities called LEMON (Local Expansion via Minimum One Norm). The algorithm finds the community by seeking a sparse vector in the span of the local spectra such that the seeds are in its support. We show that LEMON can achieve the highest detection accuracy among state-of-the-art proposals. The running time depends on the size of the community rather than that of the entire graph. The algorithm is easy to implement, and is highly parallelizable. We further provide theoretical analysis on the local spectral properties, bounding the measure of tightness of extracted community in terms of the eigenvalues of graph Laplacian. Moreover, given that networks are not all similar in nature, a comprehensive analysis on how the local expansion approach is suited for uncovering communities in different networks is still lacking. We thoroughly evaluate our approach using both synthetic and real-world datasets across different domains, and analyze the empirical variations when applying our method to inherently different networks in practice. In addition, the heuristics on how the seed set quality and quantity would affect the performance are provided.Comment: Extended version to the conference proceeding in WWW'1

    Heat kernel based community detection

    Full text link
    The heat kernel is a particular type of graph diffusion that, like the much-used personalized PageRank diffusion, is useful in identifying a community nearby a starting seed node. We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces. Our algorithm is formally a relaxation method for solving a linear system to estimate the matrix exponential in a degree-weighted norm. We prove that this algorithm stays localized in a large graph and has a worst-case constant runtime that depends only on the parameters of the diffusion, not the size of the graph. Our experiments on real-world networks indicate that the communities produced by this method have better conductance than those produced by PageRank, although they take slightly longer to compute on large graphs. On a real-world community identification task, the heat kernel communities perform better than those from the PageRank diffusion.Comment: 10 pages, published in KDD2014 proceedings; Contains minor correction to experiments from original versio

    Leveraging local network communities to predict academic performance

    Full text link
    For more than 20 years, social network analysis of student collaboration networks has focused on a student's centrality to predict academic performance. And even though a growing amount of sociological literature has supported that academic success is contagious, identifying central students in the network alone does not capture how peer interactions facilitate the spread of academic success throughout the network. Consequently, we propose novel predictors that treat academic success as a contagion by identifying a student's learning community, consisting of the peers that are most likely to influence a student's performance in a course. We evaluate the importance of these learning communities by predicting academic outcomes in an introductory college statistics course with 103 students. In particular, we observe that by including these learning community predictors, the resulting model is 68 times more likely to be the correct model than the current state-of-the-art centrality network models in the literature.Comment: 12 pages, 5 figure

    Inferring Fine-grained Details on User Activities and Home Location from Social Media: Detecting Drinking-While-Tweeting Patterns in Communities

    Full text link
    Nearly all previous work on geo-locating latent states and activities from social media confounds general discussions about activities, self-reports of users participating in those activities at times in the past or future, and self-reports made at the immediate time and place the activity occurs. Activities, such as alcohol consumption, may occur at different places and types of places, and it is important not only to detect the local regions where these activities occur, but also to analyze the degree of participation in them by local residents. In this paper, we develop new machine learning based methods for fine-grained localization of activities and home locations from Twitter data. We apply these methods to discover and compare alcohol consumption patterns in a large urban area, New York City, and a more suburban and rural area, Monroe County. We find positive correlations between the rate of alcohol consumption reported among a community's Twitter users and the density of alcohol outlets, demonstrating that the degree of correlation varies significantly between urban and suburban areas. While our experiments are focused on alcohol use, our methods for locating homes and distinguishing temporally-specific self-reports are applicable to a broad range of behaviors and latent states.Comment: 12 pages, 7 figures, 4-page poster version accepted at ICWSM 2016, alcohol dataset and keywords available in: cs.rochester.edu/u/nhossain/icwsm-16-data.zi

    A Local Spectral Method for Graphs: with Applications to Improving Graph Partitions and Exploring Data Graphs Locally

    Full text link
    The second eigenvalue of the Laplacian matrix and its associated eigenvector are fundamental features of an undirected graph, and as such they have found widespread use in scientific computing, machine learning, and data analysis. In many applications, however, graphs that arise have several \emph{local} regions of interest, and the second eigenvector will typically fail to provide information fine-tuned to each local region. In this paper, we introduce a locally-biased analogue of the second eigenvector, and we demonstrate its usefulness at highlighting local properties of data graphs in a semi-supervised manner. To do so, we first view the second eigenvector as the solution to a constrained optimization problem, and we incorporate the local information as an additional constraint; we then characterize the optimal solution to this new problem and show that it can be interpreted as a generalization of a Personalized PageRank vector; and finally, as a consequence, we show that the solution can be computed in nearly-linear time. In addition, we show that this locally-biased vector can be used to compute an approximation to the best partition \emph{near} an input seed set in a manner analogous to the way in which the second eigenvector of the Laplacian can be used to obtain an approximation to the best partition in the entire input graph. Such a primitive is useful for identifying and refining clusters locally, as it allows us to focus on a local region of interest in a semi-supervised manner. Finally, we provide a detailed empirical evaluation of our method by showing how it can applied to finding locally-biased sparse cuts around an input vertex seed set in social and information networks.Comment: 24 pages. Completely rewritten; substance is still the same, but the presentation is reworke
    • …
    corecore