14,580 research outputs found

    Fuzzy methods for analysis of microarrays and networks

    Get PDF
    Bioinformatics involves analyses of biological data such as DNA sequences, microarrays and protein-protein interaction (PPI) networks. Its two main objectives are the identification of genes or proteins and the prediction of their functions. Biological data often contain uncertain and imprecise information. Fuzzy theory provides useful tools to deal with this type of information, hence has played an important role in analyses of biological data. In this thesis, we aim to develop some new fuzzy techniques and apply them on DNA microarrays and PPI networks. We will focus on three problems: (1) clustering of microarrays; (2) identification of disease-associated genes in microarrays; and (3) identification of protein complexes in PPI networks. The first part of the thesis aims to detect, by the fuzzy C-means (FCM) method, clustering structures in DNA microarrays corrupted by noise. Because of the presence of noise, some clustering structures found in random data may not have any biological significance. In this part, we propose to combine the FCM with the empirical mode decomposition (EMD) for clustering microarray data. The purpose of EMD is to reduce, preferably to remove, the effect of noise, resulting in what is known as denoised data. We call this method the fuzzy C-means method with empirical mode decomposition (FCM-EMD). We applied this method on yeast and serum microarrays, and the silhouette values are used for assessment of the quality of clustering. The results indicate that the clustering structures of denoised data are more reasonable, implying that genes have tighter association with their clusters. Furthermore we found that the estimation of the fuzzy parameter m, which is a difficult step, can be avoided to some extent by analysing denoised microarray data. The second part aims to identify disease-associated genes from DNA microarray data which are generated under different conditions, e.g., patients and normal people. We developed a type-2 fuzzy membership (FM) function for identification of diseaseassociated genes. This approach is applied to diabetes and lung cancer data, and a comparison with the original FM test was carried out. Among the ten best-ranked genes of diabetes identified by the type-2 FM test, seven genes have been confirmed as diabetes-associated genes according to gene description information in Gene Bank and the published literature. An additional gene is further identified. Among the ten best-ranked genes identified in lung cancer data, seven are confirmed that they are associated with lung cancer or its treatment. The type-2 FM-d values are significantly different, which makes the identifications more convincing than the original FM test. The third part of the thesis aims to identify protein complexes in large interaction networks. Identification of protein complexes is crucial to understand the principles of cellular organisation and to predict protein functions. In this part, we proposed a novel method which combines the fuzzy clustering method and interaction probability to identify the overlapping and non-overlapping community structures in PPI networks, then to detect protein complexes in these sub-networks. Our method is based on both the fuzzy relation model and the graph model. We applied the method on several PPI networks and compared with a popular protein complex identification method, the clique percolation method. For the same data, we detected more protein complexes. We also applied our method on two social networks. The results showed our method works well for detecting sub-networks and give a reasonable understanding of these communities

    Evidential Communities for Complex Networks

    Get PDF
    Community detection is of great importance for understand-ing graph structure in social networks. The communities in real-world networks are often overlapped, i.e. some nodes may be a member of multiple clusters. How to uncover the overlapping communities/clusters in a complex network is a general problem in data mining of network data sets. In this paper, a novel algorithm to identify overlapping communi-ties in complex networks by a combination of an evidential modularity function, a spectral mapping method and evidential c-means clustering is devised. Experimental results indicate that this detection approach can take advantage of the theory of belief functions, and preforms good both at detecting community structure and determining the appropri-ate number of clusters. Moreover, the credal partition obtained by the proposed method could give us a deeper insight into the graph structure

    Median evidential c-means algorithm and its application to community detection

    Get PDF
    Median clustering is of great value for partitioning relational data. In this paper, a new prototype-based clustering method, called Median Evidential C-Means (MECM), which is an extension of median c-means and median fuzzy c-means on the theoretical framework of belief functions is proposed. The median variant relaxes the restriction of a metric space embedding for the objects but constrains the prototypes to be in the original data set. Due to these properties, MECM could be applied to graph clustering problems. A community detection scheme for social networks based on MECM is investigated and the obtained credal partitions of graphs, which are more refined than crisp and fuzzy ones, enable us to have a better understanding of the graph structures. An initial prototype-selection scheme based on evidential semi-centrality is presented to avoid local premature convergence and an evidential modularity function is defined to choose the optimal number of communities. Finally, experiments in synthetic and real data sets illustrate the performance of MECM and show its difference to other methods

    A similarity-based community detection method with multiple prototype representation

    Get PDF
    Communities are of great importance for understanding graph structures in social networks. Some existing community detection algorithms use a single prototype to represent each group. In real applications, this may not adequately model the different types of communities and hence limits the clustering performance on social networks. To address this problem, a Similarity-based Multi-Prototype (SMP) community detection approach is proposed in this paper. In SMP, vertices in each community carry various weights to describe their degree of representativeness. This mechanism enables each community to be represented by more than one node. The centrality of nodes is used to calculate prototype weights, while similarity is utilized to guide us to partitioning the graph. Experimental results on computer generated and real-world networks clearly show that SMP performs well for detecting communities. Moreover, the method could provide richer information for the inner structure of the detected communities with the help of prototype weights compared with the existing community detection models

    Identifying Overlapping and Hierarchical Thematic Structures in Networks of Scholarly Papers: A Comparison of Three Approaches

    Get PDF
    We implemented three recently proposed approaches to the identification of overlapping and hierarchical substructures in graphs and applied the corresponding algorithms to a network of 492 information-science papers coupled via their cited sources. The thematic substructures obtained and overlaps produced by the three hierarchical cluster algorithms were compared to a content-based categorisation, which we based on the interpretation of titles and keywords. We defined sets of papers dealing with three topics located on different levels of aggregation: h-index, webometrics, and bibliometrics. We identified these topics with branches in the dendrograms produced by the three cluster algorithms and compared the overlapping topics they detected with one another and with the three pre-defined paper sets. We discuss the advantages and drawbacks of applying the three approaches to paper networks in research fields.Comment: 18 pages, 9 figure

    Modularity functions maximization with nonnegative relaxation facilitates community detection in networks

    Full text link
    We show here that the problem of maximizing a family of quantitative functions, encompassing both the modularity (Q-measure) and modularity density (D-measure), for community detection can be uniformly understood as a combinatoric optimization involving the trace of a matrix called modularity Laplacian. Instead of using traditional spectral relaxation, we apply additional nonnegative constraint into this graph clustering problem and design efficient algorithms to optimize the new objective. With the explicit nonnegative constraint, our solutions are very close to the ideal community indicator matrix and can directly assign nodes into communities. The near-orthogonal columns of the solution can be reformulated as the posterior probability of corresponding node belonging to each community. Therefore, the proposed method can be exploited to identify the fuzzy or overlapping communities and thus facilitates the understanding of the intrinsic structure of networks. Experimental results show that our new algorithm consistently, sometimes significantly, outperforms the traditional spectral relaxation approaches
    • …
    corecore