2,898 research outputs found

    Network Alignment by Discrete Ollivier-Ricci Flow

    Full text link
    In this paper, we consider the problem of approximately aligning/matching two graphs. Given two graphs G1=(V1,E1)G_{1}=(V_{1},E_{1}) and G2=(V2,E2)G_{2}=(V_{2},E_{2}), the objective is to map nodes u,v∈G1u, v \in G_1 to nodes u′,v′∈G2u',v'\in G_2 such that when u,vu, v have an edge in G1G_1, very likely their corresponding nodes u′,v′u', v' in G2G_2 are connected as well. This problem with subgraph isomorphism as a special case has extra challenges when we consider matching complex networks exhibiting the small world phenomena. In this work, we propose to use `Ricci flow metric', to define the distance between two nodes in a network. This is then used to define similarity of a pair of nodes in two networks respectively, which is the crucial step of network alignment. %computed by discrete graph curvatures and graph Ricci flows. Specifically, the Ricci curvature of an edge describes intuitively how well the local neighborhood is connected. The graph Ricci flow uniformizes discrete Ricci curvature and induces a Ricci flow metric that is insensitive to node/edge insertions and deletions. With the new metric, we can map a node in G1G_1 to a node in G2G_2 whose distance vector to only a few preselected landmarks is the most similar. The robustness of the graph metric makes it outperform other methods when tested on various complex graph models and real world network data sets (Emails, Internet, and protein interaction networks)\footnote{The source code of computing Ricci curvature and Ricci flow metric are available: https://github.com/saibalmars/GraphRicciCurvature}.Comment: Appears in the Proceedings of the 26th International Symposium on Graph Drawing and Network Visualization (GD 2018

    On Approximation Guarantees for Greedy Low Rank Optimization

    Full text link
    We provide new approximation guarantees for greedy low rank matrix estimation under standard assumptions of restricted strong convexity and smoothness. Our novel analysis also uncovers previously unknown connections between the low rank estimation and combinatorial optimization, so much so that our bounds are reminiscent of corresponding approximation bounds in submodular maximization. Additionally, we also provide statistical recovery guarantees. Finally, we present empirical comparison of greedy estimation with established baselines on two important real-world problems

    Graph2Seq: Scalable Learning Dynamics for Graphs

    Full text link
    Neural networks have been shown to be an effective tool for learning algorithms over graph-structured data. However, graph representation techniques---that convert graphs to real-valued vectors for use with neural networks---are still in their infancy. Recent works have proposed several approaches (e.g., graph convolutional networks), but these methods have difficulty scaling and generalizing to graphs with different sizes and shapes. We present Graph2Seq, a new technique that represents vertices of graphs as infinite time-series. By not limiting the representation to a fixed dimension, Graph2Seq scales naturally to graphs of arbitrary sizes and shapes. Graph2Seq is also reversible, allowing full recovery of the graph structure from the sequences. By analyzing a formal computational model for graph representation, we show that an unbounded sequence is necessary for scalability. Our experimental results with Graph2Seq show strong generalization and new state-of-the-art performance on a variety of graph combinatorial optimization problems

    Absorbing random-walk centrality: Theory and algorithms

    Full text link
    We study a new notion of graph centrality based on absorbing random walks. Given a graph G=(V,E)G=(V,E) and a set of query nodes Q⊆VQ\subseteq V, we aim to identify the kk most central nodes in GG with respect to QQ. Specifically, we consider central nodes to be absorbing for random walks that start at the query nodes QQ. The goal is to find the set of kk central nodes that minimizes the expected length of a random walk until absorption. The proposed measure, which we call kk absorbing random-walk centrality, favors diverse sets, as it is beneficial to place the kk absorbing nodes in different parts of the graph so as to "intercept" random walks that start from different query nodes. Although similar problem definitions have been considered in the literature, e.g., in information-retrieval settings where the goal is to diversify web-search results, in this paper we study the problem formally and prove some of its properties. We show that the problem is NP-hard, while the objective function is monotone and supermodular, implying that a greedy algorithm provides solutions with an approximation guarantee. On the other hand, the greedy algorithm involves expensive matrix operations that make it prohibitive to employ on large datasets. To confront this challenge, we develop more efficient algorithms based on spectral clustering and on personalized PageRank.Comment: 11 pages, 11 figures, short paper to appear at ICDM 201

    Ranking ideas for diversity and quality

    Full text link
    When selecting ideas or trying to find inspiration, designers often must sift through hundreds or thousands of ideas. This paper provides an algorithm to rank design ideas such that the ranked list simultaneously maximizes the quality and diversity of recommended designs. To do so, we first define and compare two diversity measures using Determinantal Point Processes (DPP) and additive sub-modular functions. We show that DPPs are more suitable for items expressed as text and that a greedy algorithm diversifies rankings with both theoretical guarantees and empirical performance on what is otherwise an NP-Hard problem. To produce such rankings, this paper contributes a novel way to extend quality and diversity metrics from sets to permutations of ranked lists. These rank metrics open up the use of multi-objective optimization to describe trade-offs between diversity and quality in ranked lists. We use such trade-off fronts to help designers select rankings using indifference curves. However, we also show that rankings on trade-off front share a number of top-ranked items; this means reviewing items (for a given depth like the top 10) from across the entire diversity-to-quality front incurs only a marginal increase in the number of designs considered. While the proposed techniques are general purpose enough to be used across domains, we demonstrate concrete performance on selecting items in an online design community (OpenIDEO), where our approach reduces the time required to review diverse, high-quality ideas from around 25 hours to 90 minutes. This makes evaluation of crowd-generated ideas tractable for a single designer. Our code is publicly accessible for further research

    Global and Local Structure Preserving Sparse Subspace Learning: An Iterative Approach to Unsupervised Feature Selection

    Full text link
    As we aim at alleviating the curse of high-dimensionality, subspace learning is becoming more popular. Existing approaches use either information about global or local structure of the data, and few studies simultaneously focus on global and local structures as the both of them contain important information. In this paper, we propose a global and local structure preserving sparse subspace learning (GLoSS) model for unsupervised feature selection. The model can simultaneously realize feature selection and subspace learning. In addition, we develop a greedy algorithm to establish a generic combinatorial model, and an iterative strategy based on an accelerated block coordinate descent is used to solve the GLoSS problem. We also provide whole iterate sequence convergence analysis of the proposed iterative algorithm. Extensive experiments are conducted on real-world datasets to show the superiority of the proposed approach over several state-of-the-art unsupervised feature selection approaches.Comment: 32 page, 6 figures and 60 reference

    Greedy Column Subset Selection for Large-scale Data Sets

    Full text link
    In today's information systems, the availability of massive amounts of data necessitates the development of fast and accurate algorithms to summarize these data and represent them in a succinct format. One crucial problem in big data analytics is the selection of representative instances from large and massively-distributed data, which is formally known as the Column Subset Selection (CSS) problem. The solution to this problem enables data analysts to understand the insights of the data and explore its hidden structure. The selected instances can also be used for data preprocessing tasks such as learning a low-dimensional embedding of the data points or computing a low-rank approximation of the corresponding matrix. This paper presents a fast and accurate greedy algorithm for large-scale column subset selection. The algorithm minimizes an objective function which measures the reconstruction error of the data matrix based on the subset of selected columns. The paper first presents a centralized greedy algorithm for column subset selection which depends on a novel recursive formula for calculating the reconstruction error of the data matrix. The paper then presents a MapReduce algorithm which selects a few representative columns from a matrix whose columns are massively distributed across several commodity machines. The algorithm first learns a concise representation of all columns using random projection, and it then solves a generalized column subset selection problem at each machine in which a subset of columns are selected from the sub-matrix on that machine such that the reconstruction error of the concise representation is minimized. The paper demonstrates the effectiveness and efficiency of the proposed algorithm through an empirical evaluation on benchmark data sets.Comment: Under consideration for publication in Knowledge and Information System

    Self-Expressive Decompositions for Matrix Approximation and Clustering

    Full text link
    Data-aware methods for dimensionality reduction and matrix decomposition aim to find low-dimensional structure in a collection of data. Classical approaches discover such structure by learning a basis that can efficiently express the collection. Recently, "self expression", the idea of using a small subset of data vectors to represent the full collection, has been developed as an alternative to learning. Here, we introduce a scalable method for computing sparse SElf-Expressive Decompositions (SEED). SEED is a greedy method that constructs a basis by sequentially selecting incoherent vectors from the dataset. After forming a basis from a subset of vectors in the dataset, SEED then computes a sparse representation of the dataset with respect to this basis. We develop sufficient conditions under which SEED exactly represents low rank matrices and vectors sampled from a unions of independent subspaces. We show how SEED can be used in applications ranging from matrix approximation and denoising to clustering, and apply it to numerous real-world datasets. Our results demonstrate that SEED is an attractive low-complexity alternative to other sparse matrix factorization approaches such as sparse PCA and self-expressive methods for clustering.Comment: 11 pages, 7 figure

    On landmark selection and sampling in high-dimensional data analysis

    Full text link
    In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nystrom extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams.Comment: 18 pages, 6 figures, submitted for publicatio

    Learning Generative Models of Similarity Matrices

    Full text link
    We describe a probabilistic (generative) view of affinity matrices along with inference algorithms for a subclass of problems associated with data clustering. This probabilistic view is helpful in understanding different models and algorithms that are based on affinity functions OF the data. IN particular, we show how(greedy) inference FOR a specific probabilistic model IS equivalent TO the spectral clustering algorithm.It also provides a framework FOR developing new algorithms AND extended models. AS one CASE, we present new generative data clustering models that allow us TO infer the underlying distance measure suitable for the clustering problem at hand. These models seem to perform well in a larger class of problems for which other clustering algorithms (including spectral clustering) usually fail. Experimental evaluation was performed in a variety point data sets, showing excellent performance.Comment: Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003
    • …
    corecore