5,566 research outputs found

    Clustering with shallow trees

    Full text link
    We propose a new method for hierarchical clustering based on the optimisation of a cost function over trees of limited depth, and we derive a message--passing method that allows to solve it efficiently. The method and algorithm can be interpreted as a natural interpolation between two well-known approaches, namely single linkage and the recently presented Affinity Propagation. We analyze with this general scheme three biological/medical structured datasets (human population based on genetic information, proteins based on sequences and verbal autopsies) and show that the interpolation technique provides new insight.Comment: 11 pages, 7 figure

    Structure fusion based on graph convolutional networks for semi-supervised classification

    Full text link
    Suffering from the multi-view data diversity and complexity for semi-supervised classification, most of existing graph convolutional networks focus on the networks architecture construction or the salient graph structure preservation, and ignore the the complete graph structure for semi-supervised classification contribution. To mine the more complete distribution structure from multi-view data with the consideration of the specificity and the commonality, we propose structure fusion based on graph convolutional networks (SF-GCN) for improving the performance of semi-supervised classification. SF-GCN can not only retain the special characteristic of each view data by spectral embedding, but also capture the common style of multi-view data by distance metric between multi-graph structures. Suppose the linear relationship between multi-graph structures, we can construct the optimization function of structure fusion model by balancing the specificity loss and the commonality loss. By solving this function, we can simultaneously obtain the fusion spectral embedding from the multi-view data and the fusion structure as adjacent matrix to input graph convolutional networks for semi-supervised classification. Experiments demonstrate that the performance of SF-GCN outperforms that of the state of the arts on three challenging datasets, which are Cora,Citeseer and Pubmed in citation networks

    A Dimensionality Reduction-Based Multi-Step Clustering Method for Robust Vessel Trajectory Analysis

    Get PDF
    The Shipboard Automatic Identification System (AIS) is crucial for navigation safety and maritime surveillance, data mining and pattern analysis of AIS information have attracted considerable attention in terms of both basic research and practical applications. Clustering of spatio-temporal AIS trajectories can be used to identify abnormal patterns and mine customary route data for transportation safety. Thus, the capacities of navigation safety and maritime traffic monitoring could be enhanced correspondingly. However, trajectory clustering is often sensitive to undesirable outliers and is essentially more complex compared with traditional point clustering. To overcome this limitation, a multi-step trajectory clustering method is proposed in this paper for robust AIS trajectory clustering. In particular, the Dynamic Time Warping (DTW), a similarity measurement method, is introduced in the first step to measure the distances between different trajectories. The calculated distances, inversely proportional to the similarities, constitute a distance matrix in the second step. Furthermore, as a widely-used dimensional reduction method, Principal Component Analysis (PCA) is exploited to decompose the obtained distance matrix. In particular, the top k principal components with above 95% accumulative contribution rate are extracted by PCA, and the number of the centers k is chosen. The k centers are found by the improved center automatically selection algorithm. In the last step, the improved center clustering algorithm with k clusters is implemented on the distance matrix to achieve the final AIS trajectory clustering results. In order to improve the accuracy of the proposed multi-step clustering algorithm, an automatic algorithm for choosing the k clusters is developed according to the similarity distance. Numerous experiments on realistic AIS trajectory datasets in the bridge area waterway and Mississippi River have been implemented to compare our proposed method with traditional spectral clustering and fast affinity propagation clustering. Experimental results have illustrated its superior performance in terms of quantitative and qualitative evaluation

    Flow-based Influence Graph Visual Summarization

    Full text link
    Visually mining a large influence graph is appealing yet challenging. People are amazed by pictures of newscasting graph on Twitter, engaged by hidden citation networks in academics, nevertheless often troubled by the unpleasant readability of the underlying visualization. Existing summarization methods enhance the graph visualization with blocked views, but have adverse effect on the latent influence structure. How can we visually summarize a large graph to maximize influence flows? In particular, how can we illustrate the impact of an individual node through the summarization? Can we maintain the appealing graph metaphor while preserving both the overall influence pattern and fine readability? To answer these questions, we first formally define the influence graph summarization problem. Second, we propose an end-to-end framework to solve the new problem. Our method can not only highlight the flow-based influence patterns in the visual summarization, but also inherently support rich graph attributes. Last, we present a theoretic analysis and report our experiment results. Both evidences demonstrate that our framework can effectively approximate the proposed influence graph summarization objective while outperforming previous methods in a typical scenario of visually mining academic citation networks.Comment: to appear in IEEE International Conference on Data Mining (ICDM), Shen Zhen, China, December 201

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • …
    corecore