6,406 research outputs found

    FS^3: A Sampling based method for top-k Frequent Subgraph Mining

    Get PDF
    Mining labeled subgraph is a popular research task in data mining because of its potential application in many different scientific domains. All the existing methods for this task explicitly or implicitly solve the subgraph isomorphism task which is computationally expensive, so they suffer from the lack of scalability problem when the graphs in the input database are large. In this work, we propose FS^3, which is a sampling based method. It mines a small collection of subgraphs that are most frequent in the probabilistic sense. FS^3 performs a Markov Chain Monte Carlo (MCMC) sampling over the space of a fixed-size subgraphs such that the potentially frequent subgraphs are sampled more often. Besides, FS^3 is equipped with an innovative queue manager. It stores the sampled subgraph in a finite queue over the course of mining in such a manner that the top-k positions in the queue contain the most frequent subgraphs. Our experiments on database of large graphs show that FS^3 is efficient, and it obtains subgraphs that are the most frequent amongst the subgraphs of a given size

    A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem

    Full text link
    In this paper, we consider the sparse eigenvalue problem wherein the goal is to obtain a sparse solution to the generalized eigenvalue problem. We achieve this by constraining the cardinality of the solution to the generalized eigenvalue problem and obtain sparse principal component analysis (PCA), sparse canonical correlation analysis (CCA) and sparse Fisher discriminant analysis (FDA) as special cases. Unlike the β„“1\ell_1-norm approximation to the cardinality constraint, which previous methods have used in the context of sparse PCA, we propose a tighter approximation that is related to the negative log-likelihood of a Student's t-distribution. The problem is then framed as a d.c. (difference of convex functions) program and is solved as a sequence of convex programs by invoking the majorization-minimization method. The resulting algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for any random initialization, the sequence (subsequence) of iterates generated by the algorithm converges to a stationary point of the d.c. program. The performance of the algorithm is empirically demonstrated on both sparse PCA (finding few relevant genes that explain as much variance as possible in a high-dimensional gene dataset) and sparse CCA (cross-language document retrieval and vocabulary selection for music retrieval) applications.Comment: 40 page

    A Comparison of Relaxations of Multiset Cannonical Correlation Analysis and Applications

    Full text link
    Canonical correlation analysis is a statistical technique that is used to find relations between two sets of variables. An important extension in pattern analysis is to consider more than two sets of variables. This problem can be expressed as a quadratically constrained quadratic program (QCQP), commonly referred to Multi-set Canonical Correlation Analysis (MCCA). This is a non-convex problem and so greedy algorithms converge to local optima without any guarantees on global optimality. In this paper, we show that despite being highly structured, finding the optimal solution is NP-Hard. This motivates our relaxation of the QCQP to a semidefinite program (SDP). The SDP is convex, can be solved reasonably efficiently and comes with both absolute and output-sensitive approximation quality. In addition to theoretical guarantees, we do an extensive comparison of the QCQP method and the SDP relaxation on a variety of synthetic and real world data. Finally, we present two useful extensions: we incorporate kernel methods and computing multiple sets of canonical vectors
    • …
    corecore