20,122 research outputs found

    Clustering and Community Detection in Directed Networks: A Survey

    Full text link
    Networks (or graphs) appear as dominant structures in diverse domains, including sociology, biology, neuroscience and computer science. In most of the aforementioned cases graphs are directed - in the sense that there is directionality on the edges, making the semantics of the edges non symmetric. An interesting feature that real networks present is the clustering or community structure property, under which the graph topology is organized into modules commonly called communities or clusters. The essence here is that nodes of the same community are highly similar while on the contrary, nodes across communities present low similarity. Revealing the underlying community structure of directed complex networks has become a crucial and interdisciplinary topic with a plethora of applications. Therefore, naturally there is a recent wealth of research production in the area of mining directed graphs - with clustering being the primary method and tool for community detection and evaluation. The goal of this paper is to offer an in-depth review of the methods presented so far for clustering directed networks along with the relevant necessary methodological background and also related applications. The survey commences by offering a concise review of the fundamental concepts and methodological base on which graph clustering algorithms capitalize on. Then we present the relevant work along two orthogonal classifications. The first one is mostly concerned with the methodological principles of the clustering algorithms, while the second one approaches the methods from the viewpoint regarding the properties of a good cluster in a directed network. Further, we present methods and metrics for evaluating graph clustering results, demonstrate interesting application domains and provide promising future research directions.Comment: 86 pages, 17 figures. Physics Reports Journal (To Appear

    Towards combinatorial clustering: preliminary research survey

    Full text link
    The paper describes clustering problems from the combinatorial viewpoint. A brief systemic survey is presented including the following: (i) basic clustering problems (e.g., classification, clustering, sorting, clustering with an order over cluster), (ii) basic approaches to assessment of objects and object proximities (i.e., scales, comparison, aggregation issues), (iii) basic approaches to evaluation of local quality characteristics for clusters and total quality characteristics for clustering solutions, (iv) clustering as multicriteria optimization problem, (v) generalized modular clustering framework, (vi) basic clustering models/methods (e.g., hierarchical clustering, k-means clustering, minimum spanning tree based clustering, clustering as assignment, detection of clisue/quasi-clique based clustering, correlation clustering, network communities based clustering), Special attention is targeted to formulation of clustering as multicriteria optimization models. Combinatorial optimization models are used as auxiliary problems (e.g., assignment, partitioning, knapsack problem, multiple choice problem, morphological clique problem, searching for consensus/median for structures). Numerical examples illustrate problem formulations, solving methods, and applications. The material can be used as follows: (a) a research survey, (b) a fundamental for designing the structure/architecture of composite modular clustering software, (c) a bibliography reference collection, and (d) a tutorial.Comment: 102 pages, 66 figures, 67 table

    Accelerating drug repurposing for COVID-19 via modeling drug mechanism of action with large scale gene-expression profiles

    Full text link
    The novel coronavirus disease, named COVID-19, emerged in China in December 2019, and has rapidly spread around the world. It is clearly urgent to fight COVID-19 at global scale. The development of methods for identifying drug uses based on phenotypic data can improve the efficiency of drug development. However, there are still many difficulties in identifying drug applications based on cell picture data. This work reported one state-of-the-art machine learning method to identify drug uses based on the cell image features of 1024 drugs generated in the LINCS program. Because the multi-dimensional features of the image are affected by non-experimental factors, the characteristics of similar drugs vary greatly, and the current sample number is not enough to use deep learning and other methods are used for learning optimization. As a consequence, this study is based on the supervised ITML algorithm to convert the characteristics of drugs. The results show that the characteristics of ITML conversion are more conducive to the recognition of drug functions. The analysis of feature conversion shows that different features play important roles in identifying different drug functions. For the current COVID-19, Chloroquine and Hydroxychloroquine achieve antiviral effects by inhibiting endocytosis, etc., and were classified to the same community. And Clomiphene in the same community inibited the entry of Ebola Virus, indicated a similar MoAs that could be reflected by cell image.Comment: 22 page

    Bridging belief function theory to modern machine learning

    Full text link
    Machine learning is a quickly evolving field which now looks really different from what it was 15 years ago, when classification and clustering were major issues. This document proposes several trends to explore the new questions of modern machine learning, with the strong afterthought that the belief function framework has a major role to play

    Survey of state-of-the-art mixed data clustering algorithms

    Full text link
    Mixed data comprises both numeric and categorical features, and mixed datasets occur frequently in many domains, such as health, finance, and marketing. Clustering is often applied to mixed datasets to find structures and to group similar objects for further analysis. However, clustering mixed data is challenging because it is difficult to directly apply mathematical operations, such as summation or averaging, to the feature values of these datasets. In this paper, we present a taxonomy for the study of mixed data clustering algorithms by identifying five major research themes. We then present a state-of-the-art review of the research works within each research theme. We analyze the strengths and weaknesses of these methods with pointers for future research directions. Lastly, we present an in-depth analysis of the overall challenges in this field, highlight open research questions and discuss guidelines to make progress in the field.Comment: 20 Pages, 2 columns, 6 Tables, 209 Reference

    Orthogonal symmetric non-negative matrix factorization under the stochastic block model

    Full text link
    We present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data.Comment: 35 pages, 3 figure

    Hybrid Clustering based on Content and Connection Structure using Joint Nonnegative Matrix Factorization

    Full text link
    We present a hybrid method for latent information discovery on the data sets containing both text content and connection structure based on constrained low rank approximation. The new method jointly optimizes the Nonnegative Matrix Factorization (NMF) objective function for text clustering and the Symmetric NMF (SymNMF) objective function for graph clustering. We propose an effective algorithm for the joint NMF objective function, based on a block coordinate descent (BCD) framework. The proposed hybrid method discovers content associations via latent connections found using SymNMF. The method can also be applied with a natural conversion of the problem when a hypergraph formulation is used or the content is associated with hypergraph edges. Experimental results show that by simultaneously utilizing both content and connection structure, our hybrid method produces higher quality clustering results compared to the other NMF clustering methods that uses content alone (standard NMF) or connection structure alone (SymNMF). We also present some interesting applications to several types of real world data such as citation recommendations of papers. The hybrid method proposed in this paper can also be applied to general data expressed with both feature space vectors and pairwise similarities and can be extended to the case with multiple feature spaces or multiple similarity measures.Comment: 9 pages, Submitted to a conference, Feb. 201

    A Memetic Algorithm for the Minimum Conductance Graph Partitioning Problem

    Full text link
    The minimum conductance problem is an NP-hard graph partitioning problem. Apart from the search for bottlenecks in complex networks, the problem is very closely related to the popular area of network community detection. In this paper, we tackle the minimum conductance problem as a pseudo-Boolean optimisation problem and propose a memetic algorithm to solve it. An efficient local search strategy is established. Our memetic algorithm starts by using this local search strategy with different random strings to sample a set of diverse initial solutions. This is followed by an evolutionary phase based on a steady-state framework and two intensification subroutines. We compare the algorithm to a wide range of multi-start local search approaches and classical genetic algorithms with different crossover operators. The experimental results are presented for a diverse set of real-world networks. These results indicate that the memetic algorithm outperforms the alternative stochastic approaches

    Identification of Overlapping Communities via Constrained Egonet Tensor Decomposition

    Full text link
    Detection of overlapping communities in real-world networks is a generally challenging task. Upon recognizing that a network is in fact the union of its egonets, a novel network representation using multi-way data structures is advocated in this contribution. The introduced sparse tensor-based representation exhibits richer structure compared to its matrix counterpart, and thus enables a more robust approach to community detection. To leverage this structure, a constrained tensor approximation framework is introduced using PARAFAC decomposition. The arising constrained trilinear optimization is handled via alternating minimization, where intermediate subproblems are solved using the alternating direction method of multipliers (ADMM) to ensure convergence. The factors obtained provide soft community memberships, which can further be exploited for crisp, and possibly-overlapping community assignments. The framework is further broadened to include time-varying graphs, where the edgeset as well as the underlying communities evolve through time. Performance of the proposed approach is assessed via tests on benchmark synthetic graphs as well as real-world networks. As corroborated by numerical tests, the proposed tensor-based representation captures multi-hop nodal connections, that is, connectivity patterns within single-hop neighbors, whose exploitation yields a more robust community identification in the presence of mixing as well as overlapping communities
    • …
    corecore