20,122 research outputs found
Clustering and Community Detection in Directed Networks: A Survey
Networks (or graphs) appear as dominant structures in diverse domains,
including sociology, biology, neuroscience and computer science. In most of the
aforementioned cases graphs are directed - in the sense that there is
directionality on the edges, making the semantics of the edges non symmetric.
An interesting feature that real networks present is the clustering or
community structure property, under which the graph topology is organized into
modules commonly called communities or clusters. The essence here is that nodes
of the same community are highly similar while on the contrary, nodes across
communities present low similarity. Revealing the underlying community
structure of directed complex networks has become a crucial and
interdisciplinary topic with a plethora of applications. Therefore, naturally
there is a recent wealth of research production in the area of mining directed
graphs - with clustering being the primary method and tool for community
detection and evaluation. The goal of this paper is to offer an in-depth review
of the methods presented so far for clustering directed networks along with the
relevant necessary methodological background and also related applications. The
survey commences by offering a concise review of the fundamental concepts and
methodological base on which graph clustering algorithms capitalize on. Then we
present the relevant work along two orthogonal classifications. The first one
is mostly concerned with the methodological principles of the clustering
algorithms, while the second one approaches the methods from the viewpoint
regarding the properties of a good cluster in a directed network. Further, we
present methods and metrics for evaluating graph clustering results,
demonstrate interesting application domains and provide promising future
research directions.Comment: 86 pages, 17 figures. Physics Reports Journal (To Appear
Towards combinatorial clustering: preliminary research survey
The paper describes clustering problems from the combinatorial viewpoint. A
brief systemic survey is presented including the following: (i) basic
clustering problems (e.g., classification, clustering, sorting, clustering with
an order over cluster), (ii) basic approaches to assessment of objects and
object proximities (i.e., scales, comparison, aggregation issues), (iii) basic
approaches to evaluation of local quality characteristics for clusters and
total quality characteristics for clustering solutions, (iv) clustering as
multicriteria optimization problem, (v) generalized modular clustering
framework, (vi) basic clustering models/methods (e.g., hierarchical clustering,
k-means clustering, minimum spanning tree based clustering, clustering as
assignment, detection of clisue/quasi-clique based clustering, correlation
clustering, network communities based clustering), Special attention is
targeted to formulation of clustering as multicriteria optimization models.
Combinatorial optimization models are used as auxiliary problems (e.g.,
assignment, partitioning, knapsack problem, multiple choice problem,
morphological clique problem, searching for consensus/median for structures).
Numerical examples illustrate problem formulations, solving methods, and
applications. The material can be used as follows: (a) a research survey, (b) a
fundamental for designing the structure/architecture of composite modular
clustering software, (c) a bibliography reference collection, and (d) a
tutorial.Comment: 102 pages, 66 figures, 67 table
Accelerating drug repurposing for COVID-19 via modeling drug mechanism of action with large scale gene-expression profiles
The novel coronavirus disease, named COVID-19, emerged in China in December
2019, and has rapidly spread around the world. It is clearly urgent to fight
COVID-19 at global scale. The development of methods for identifying drug uses
based on phenotypic data can improve the efficiency of drug development.
However, there are still many difficulties in identifying drug applications
based on cell picture data. This work reported one state-of-the-art machine
learning method to identify drug uses based on the cell image features of 1024
drugs generated in the LINCS program. Because the multi-dimensional features of
the image are affected by non-experimental factors, the characteristics of
similar drugs vary greatly, and the current sample number is not enough to use
deep learning and other methods are used for learning optimization. As a
consequence, this study is based on the supervised ITML algorithm to convert
the characteristics of drugs. The results show that the characteristics of ITML
conversion are more conducive to the recognition of drug functions. The
analysis of feature conversion shows that different features play important
roles in identifying different drug functions. For the current COVID-19,
Chloroquine and Hydroxychloroquine achieve antiviral effects by inhibiting
endocytosis, etc., and were classified to the same community. And Clomiphene in
the same community inibited the entry of Ebola Virus, indicated a similar MoAs
that could be reflected by cell image.Comment: 22 page
Bridging belief function theory to modern machine learning
Machine learning is a quickly evolving field which now looks really different
from what it was 15 years ago, when classification and clustering were major
issues. This document proposes several trends to explore the new questions of
modern machine learning, with the strong afterthought that the belief function
framework has a major role to play
Survey of state-of-the-art mixed data clustering algorithms
Mixed data comprises both numeric and categorical features, and mixed
datasets occur frequently in many domains, such as health, finance, and
marketing. Clustering is often applied to mixed datasets to find structures and
to group similar objects for further analysis. However, clustering mixed data
is challenging because it is difficult to directly apply mathematical
operations, such as summation or averaging, to the feature values of these
datasets. In this paper, we present a taxonomy for the study of mixed data
clustering algorithms by identifying five major research themes. We then
present a state-of-the-art review of the research works within each research
theme. We analyze the strengths and weaknesses of these methods with pointers
for future research directions. Lastly, we present an in-depth analysis of the
overall challenges in this field, highlight open research questions and discuss
guidelines to make progress in the field.Comment: 20 Pages, 2 columns, 6 Tables, 209 Reference
Orthogonal symmetric non-negative matrix factorization under the stochastic block model
We present a method based on the orthogonal symmetric non-negative matrix
tri-factorization of the normalized Laplacian matrix for community detection in
complex networks. While the exact factorization of a given order may not exist
and is NP hard to compute, we obtain an approximate factorization by solving an
optimization problem. We establish the connection of the factors obtained
through the factorization to a non-negative basis of an invariant subspace of
the estimated matrix, drawing parallel with the spectral clustering. Using such
factorization for clustering in networks is motivated by analyzing a
block-diagonal Laplacian matrix with the blocks representing the connected
components of a graph. The method is shown to be consistent for community
detection in graphs generated from the stochastic block model and the degree
corrected stochastic block model. Simulation results and real data analysis
show the effectiveness of these methods under a wide variety of situations,
including sparse and highly heterogeneous graphs where the usual spectral
clustering is known to fail. Our method also performs better than the state of
the art in popular benchmark network datasets, e.g., the political web blogs
and the karate club data.Comment: 35 pages, 3 figure
Hybrid Clustering based on Content and Connection Structure using Joint Nonnegative Matrix Factorization
We present a hybrid method for latent information discovery on the data sets
containing both text content and connection structure based on constrained low
rank approximation. The new method jointly optimizes the Nonnegative Matrix
Factorization (NMF) objective function for text clustering and the Symmetric
NMF (SymNMF) objective function for graph clustering. We propose an effective
algorithm for the joint NMF objective function, based on a block coordinate
descent (BCD) framework. The proposed hybrid method discovers content
associations via latent connections found using SymNMF. The method can also be
applied with a natural conversion of the problem when a hypergraph formulation
is used or the content is associated with hypergraph edges.
Experimental results show that by simultaneously utilizing both content and
connection structure, our hybrid method produces higher quality clustering
results compared to the other NMF clustering methods that uses content alone
(standard NMF) or connection structure alone (SymNMF). We also present some
interesting applications to several types of real world data such as citation
recommendations of papers. The hybrid method proposed in this paper can also be
applied to general data expressed with both feature space vectors and pairwise
similarities and can be extended to the case with multiple feature spaces or
multiple similarity measures.Comment: 9 pages, Submitted to a conference, Feb. 201
A Memetic Algorithm for the Minimum Conductance Graph Partitioning Problem
The minimum conductance problem is an NP-hard graph partitioning problem.
Apart from the search for bottlenecks in complex networks, the problem is very
closely related to the popular area of network community detection. In this
paper, we tackle the minimum conductance problem as a pseudo-Boolean
optimisation problem and propose a memetic algorithm to solve it. An efficient
local search strategy is established. Our memetic algorithm starts by using
this local search strategy with different random strings to sample a set of
diverse initial solutions. This is followed by an evolutionary phase based on a
steady-state framework and two intensification subroutines. We compare the
algorithm to a wide range of multi-start local search approaches and classical
genetic algorithms with different crossover operators. The experimental results
are presented for a diverse set of real-world networks. These results indicate
that the memetic algorithm outperforms the alternative stochastic approaches
Recommended from our members
Community detection in network analysis: a survey
The existence of community structures in networks is not unusual, including in the domains of sociology, biology, and business, etc. The characteristic of the community structure is that nodes of the same community are highly similar while on the contrary, nodes across communities present low similarity.
In academia, there is a surge in research efforts on community detection in network analysis, especially in developing statistically sound methodologies for exploring, modeling, and interpreting these kind of structures and relationships.
This survey paper aims to provide a brief review of current applicable
statistical methodologies and approaches in a comparative manner along with metrics for evaluating graph clustering results and application using R. At the
end, we provide promising future research directions.Statistic
Identification of Overlapping Communities via Constrained Egonet Tensor Decomposition
Detection of overlapping communities in real-world networks is a generally
challenging task. Upon recognizing that a network is in fact the union of its
egonets, a novel network representation using multi-way data structures is
advocated in this contribution. The introduced sparse tensor-based
representation exhibits richer structure compared to its matrix counterpart,
and thus enables a more robust approach to community detection. To leverage
this structure, a constrained tensor approximation framework is introduced
using PARAFAC decomposition. The arising constrained trilinear optimization is
handled via alternating minimization, where intermediate subproblems are solved
using the alternating direction method of multipliers (ADMM) to ensure
convergence. The factors obtained provide soft community memberships, which can
further be exploited for crisp, and possibly-overlapping community assignments.
The framework is further broadened to include time-varying graphs, where the
edgeset as well as the underlying communities evolve through time. Performance
of the proposed approach is assessed via tests on benchmark synthetic graphs as
well as real-world networks. As corroborated by numerical tests, the proposed
tensor-based representation captures multi-hop nodal connections, that is,
connectivity patterns within single-hop neighbors, whose exploitation yields a
more robust community identification in the presence of mixing as well as
overlapping communities
- …