16,787 research outputs found
Recommended from our members
Overlapping community detection in massive social networks
Massive social networks have become increasingly popular in recent years. Community detection is one of the most important techniques for the analysis of such complex networks. A community is a set of cohesive vertices that has more connections inside the set than outside. In many social and information networks, these communities naturally overlap. For instance, in a social network, each vertex in a graph corresponds to an individual who usually participates in multiple communities. In this thesis, we propose scalable overlapping community detection algorithms that effectively identify high quality overlapping communities in various real-world networks.
We first develop an efficient overlapping community detection algorithm using a seed set expansion approach. The key idea of this algorithm is to find good seeds and then greedily expand these seeds using a personalized PageRank clustering scheme. Experimental results show that our algorithm significantly outperforms other state-of-the-art overlapping community detection methods in terms of run time, cohesiveness of communities, and ground-truth accuracy.
To develop more principled methods, we formulate the overlapping community detection problem as a non-exhaustive, overlapping graph clustering problem where clusters are allowed to overlap with each other, and some nodes are allowed to be outside of any cluster. To tackle this non-exhaustive, overlapping clustering problem, we propose a simple and intuitive objective function that captures the issues of overlap and non-exhaustiveness in a unified manner. To optimize the objective, we develop not only fast iterative algorithms but also more sophisticated algorithms using a low-rank semidefinite programming technique. Our experimental results show that the new objective and the algorithms are effective in finding ground-truth clusterings that have varied overlap and non-exhaustiveness.
We extend our non-exhaustive, overlapping clustering techniques to co-clustering where the goal is to simultaneously identify a clustering of the rows as well as the columns of a data matrix. As an example application, consider recommender systems where users have ratings on items. This can be represented by a bipartite graph where users and items are denoted by two different types of nodes, and the ratings are denoted by weighted edges between the users and the items. In this case, co-clustering would be a simultaneous clustering of users and items. We propose a new co-clustering objective function and an efficient co-clustering algorithm that is able to identify overlapping clusters as well as outliers on both types of the nodes in the bipartite graph. We show that our co-clustering algorithm is able to effectively capture the underlying co-clustering structure of the data, which results in boosting the performance of a standard one-dimensional clustering.
Finally, we study the design of parallel data-driven algorithms, which enables us to further increase the scalability of our overlapping community detection algorithms. Using PageRank as a model problem, we look at three algorithm design axes: work activation, data access pattern, and scheduling. We investigate the impact of different algorithm design choices. Using these design axes, we design and test a variety of PageRank implementations finding that data-driven, push-based algorithms are able to achieve a significantly superior scalability than standard PageRank implementations. The design choices affect both single-threaded performance as well as parallel scalability. The lessons learned from this study not only guide efficient implementations of many graph mining algorithms but also provide a framework for designing new scalable algorithms, especially for large-scale community detection.Computer Science
Measuring Visual Complexity of Cluster-Based Visualizations
Handling visual complexity is a challenging problem in visualization owing to
the subjectiveness of its definition and the difficulty in devising
generalizable quantitative metrics. In this paper we address this challenge by
measuring the visual complexity of two common forms of cluster-based
visualizations: scatter plots and parallel coordinatess. We conceptualize
visual complexity as a form of visual uncertainty, which is a measure of the
degree of difficulty for humans to interpret a visual representation correctly.
We propose an algorithm for estimating visual complexity for the aforementioned
visualizations using Allen's interval algebra. We first establish a set of
primitive 2-cluster cases in scatter plots and another set for parallel
coordinatess based on symmetric isomorphism. We confirm that both are the
minimal sets and verify the correctness of their members computationally. We
score the uncertainty of each primitive case based on its topological
properties, including the existence of overlapping regions, splitting regions
and meeting points or edges. We compare a few optional scoring schemes against
a set of subjective scores by humans, and identify the one that is the most
consistent with the subjective scores. Finally, we extend the 2-cluster measure
to k-cluster measure as a general purpose estimator of visual complexity for
these two forms of cluster-based visualization
A Dynamic Clustering and Resource Allocation Algorithm for Downlink CoMP Systems with Multiple Antenna UEs
Coordinated multi-point (CoMP) schemes have been widely studied in the recent
years to tackle the inter-cell interference. In practice, latency and
throughput constraints on the backhaul allow the organization of only small
clusters of base stations (BSs) where joint processing (JP) can be implemented.
In this work we focus on downlink CoMP-JP with multiple antenna user equipments
(UEs) and propose a novel dynamic clustering algorithm. The additional degrees
of freedom at the UE can be used to suppress the residual interference by using
an interference rejection combiner (IRC) and allow a multistream transmission.
In our proposal we first define a set of candidate clusters depending on
long-term channel conditions. Then, in each time block, we develop a resource
allocation scheme by jointly optimizing transmitter and receiver where: a)
within each candidate cluster a weighted sum rate is estimated and then b) a
set of clusters is scheduled in order to maximize the system weighted sum rate.
Numerical results show that much higher rates are achieved when UEs are
equipped with multiple antennas. Moreover, as this performance improvement is
mainly due to the IRC, the gain achieved by the proposed approach with respect
to the non-cooperative scheme decreases by increasing the number of UE
antennas.Comment: 27 pages, 8 figure
Techniques for clustering gene expression data
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered
Scalable and interpretable product recommendations via overlapping co-clustering
We consider the problem of generating interpretable recommendations by
identifying overlapping co-clusters of clients and products, based only on
positive or implicit feedback. Our approach is applicable on very large
datasets because it exhibits almost linear complexity in the input examples and
the number of co-clusters. We show, both on real industrial data and on
publicly available datasets, that the recommendation accuracy of our algorithm
is competitive to that of state-of-art matrix factorization techniques. In
addition, our technique has the advantage of offering recommendations that are
textually and visually interpretable. Finally, we examine how to implement our
technique efficiently on Graphical Processing Units (GPUs).Comment: In IEEE International Conference on Data Engineering (ICDE) 201
- …