13,355 research outputs found

    On Spectral Graph Embedding: A Non-Backtracking Perspective and Graph Approximation

    Full text link
    Graph embedding has been proven to be efficient and effective in facilitating graph analysis. In this paper, we present a novel spectral framework called NOn-Backtracking Embedding (NOBE), which offers a new perspective that organizes graph data at a deep level by tracking the flow traversing on the edges with backtracking prohibited. Further, by analyzing the non-backtracking process, a technique called graph approximation is devised, which provides a channel to transform the spectral decomposition on an edge-to-edge matrix to that on a node-to-node matrix. Theoretical guarantees are provided by bounding the difference between the corresponding eigenvalues of the original graph and its graph approximation. Extensive experiments conducted on various real-world networks demonstrate the efficacy of our methods on both macroscopic and microscopic levels, including clustering and structural hole spanner detection.Comment: SDM 2018 (Full version including all proofs

    Spectral comparison of large urban graphs

    Get PDF
    The spectrum of an axial graph is proposed as a means for comparison between spaces, particularly for measuring between very large and complex graphs. A number of methods have been used in recent years for comparative analysis within large sets of urban areas, both to investigate properties of specific known types of street network or to propose a taxonomy of urban morphology based on an analytical technique. In many cases, a single or small range of predefined, scalar measures such as metric distance, integration, control or clustering coefficient have been used to compare the graphs. While these measures are well understood theoretically, their low dimensionality determines the range of observations that can ultimately be drawn from the data. Spectral analysis consists of a high dimensional vector representing each space, between which metric distance may be measured to indicate the overall difference between two spaces, or subspaces may be extracted to correspond to certain features. It is used for comparison of entire urban graphs, to determine similarities (and differences) in their overall structure. Results are shown of a comparison of 152 cities distributed around the world. The clustering of cities of similar properties in a high dimensional space is discussed. Principal and nonlinear components of the data set indicate significant correlations in the graph similarities between cities and their proximity to one another, suggesting that cultural features based on location are evident in the city form and that these can be quantified by the proposed method. Results of classification tests show that a city’s location can be estimated based purely on its form. The high dimensionality of the spectra is beneficial for its utility in data-mining applications that can draw correlations with other data sets such as land use information. It is shown how further processing by supervised learning allows the extraction of relevant features. A methodological comparison is also drawn with statistical studies that use a strong correlation between human genetic markers and geographical location of populations to derive detailed reconstructions of prehistoric migration. Thus, it is suggested that the method may be utilised for mapping the transfer of cultural memes by measuring comparison between cities

    Techniques for clustering gene expression data

    Get PDF
    Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered

    A Statistical Toolbox For Mining And Modeling Spatial Data

    Get PDF
    Most data mining projects in spatial economics start with an evaluation of a set of attribute variables on a sample of spatial entities, looking for the existence and strength of spatial autocorrelation, based on the Moran’s and the Geary’s coefficients, the adequacy of which is rarely challenged, despite the fact that when reporting on their properties, many users seem likely to make mistakes and to foster confusion. My paper begins by a critical appraisal of the classical definition and rational of these indices. I argue that while intuitively founded, they are plagued by an inconsistency in their conception. Then, I propose a principled small change leading to corrected spatial autocorrelation coefficients, which strongly simplifies their relationship, and opens the way to an augmented toolbox of statistical methods of dimension reduction and data visualization, also useful for modeling purposes. A second section presents a formal framework, adapted from recent work in statistical learning, which gives theoretical support to our definition of corrected spatial autocorrelation coefficients. More specifically, the multivariate data mining methods presented here, are easily implementable on the existing (free) software, yield methods useful to exploit the proposed corrections in spatial data analysis practice, and, from a mathematical point of view, whose asymptotic behavior, already studied in a series of papers by Belkin & Niyogi, suggests that they own qualities of robustness and a limited sensitivity to the Modifiable Areal Unit Problem (MAUP), valuable in exploratory spatial data analysis

    Clustering cancer gene expression data: a comparative study

    Get PDF
    Background The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context. Results/Conclusion We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at http://algorithmics.molgen.mpg.de/Supplements/CompCancer/ webcite

    Analysis of problem representation in the application of artificial neural networks for feature classification in imagery

    Get PDF
    corecore