113 research outputs found

    Efficient Mining of Heterogeneous Star-Structured Data

    Get PDF
    Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing star- structured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive exper- iments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm

    Valuable Feature Improvement of Content Clustering and Categorization via Metadata

    Get PDF
    Every record contains side-data in content mining application. This side data may be of particular sorts, for instance, record derivation information, the links in the record, user access conduct from web logs, or other non text based qualities which are embedded into the content record. With the finished objective of clustering this behaviors (Text) contains huge measure of information. At times it is difficult to estimate the side data, in light of the way that a part of the information is noise. In such cases, it can be risky to combine side-information into the mining method, because it can either upgrade the nature of the illustration for the mining procedure, or can include noise to the approach. As needs be, we oblige a principled way to deal with perform the mining handle, so as to enlarge the inclinations from using this side information. In this subject, here figure k-medoids estimation which vanquishes the problem of k-means computation. We plan an algorithm which consolidates established parceling algorithm with probabilistic models to make a successful clustering methodology. And afterward demonstrate to extend the way to deal with the sorting issue. This general technique is used as a piece of demand to summarize both clustering what as more instruction algorithms. So the use of side-data can massively enhance the way of substance clustering and sorting, while keeping up an unusual state of efficiency. After that we put entire framework in cloud. DOI: 10.17762/ijritcc2321-8169.15078

    Evolutionary star-structured heterogeneous data co-clustering

    Get PDF
    A star-structured interrelationship, which is a more common type in real world data, has a central object connected to the other types of objects. One of the key challenges in evolutionary clustering is integration of historical data in current data. Traditionally, smoothness in data transition over a period of time is achieved by means of cost functions defined over historical and current data. These functions provide a tunable tolerance for shifts of current data accounting instance to all historical information for corresponding instance. Once historical data is integrated into current data using cost functions, co-clustering is obtained using various co-clustering algorithms like spectral clustering, non-negative matrix factorization, and information theory based clustering. Non-negative matrix factorization has been proven efficient and scalable for large data and is less memory intensive compared to other approaches. Non-negative matrix factorization tri-factorizes original data matrix into row indicator matrix, column indicator matrix, and a matrix that provides correlation between the row and column clusters. However, challenges in clustering evolving heterogeneous data have never been addressed. In this thesis, I propose a new algorithm for clustering a specific case of this problem, viz. the star-structured heterogeneous data. The proposed algorithm will provide cost functions to integrate historical star-structured heterogeneous data into current data. Then I will use non-negative matrix factorization to cluster each time-step of instances and features. This contribution to the field will provide an avenue for further development of higher order evolutionary co-clustering algorithms

    Relevance Search via Bipolar Label Diffusion on Bipartite Graphs

    Get PDF
    The task of relevance search is to find relevant items to some given queries which can be viewed either as an information retrieval problem or as a semi-supervised learning problem In order to combine both of their advantages we develop a new relevance search method using label diffusion on bipartite graphs And we propose a heat diffusion-based algorithm namely bipartite label diffusion BLD Our method yields encouraging experimental results on a number of relevance search problem

    How to Round Subspaces: A New Spectral Clustering Algorithm

    Full text link
    A basic problem in spectral clustering is the following. If a solution obtained from the spectral relaxation is close to an integral solution, is it possible to find this integral solution even though they might be in completely different basis? In this paper, we propose a new spectral clustering algorithm. It can recover a kk-partition such that the subspace corresponding to the span of its indicator vectors is O(opt)O(\sqrt{opt}) close to the original subspace in spectral norm with optopt being the minimum possible (opt≀1opt \le 1 always). Moreover our algorithm does not impose any restriction on the cluster sizes. Previously, no algorithm was known which could find a kk-partition closer than o(kβ‹…opt)o(k \cdot opt). We present two applications for our algorithm. First one finds a disjoint union of bounded degree expanders which approximate a given graph in spectral norm. The second one is for approximating the sparsest kk-partition in a graph where each cluster have expansion at most Ο•k\phi_k provided Ο•k≀O(Ξ»k+1)\phi_k \le O(\lambda_{k+1}) where Ξ»k+1\lambda_{k+1} is the (k+1)st(k+1)^{st} eigenvalue of Laplacian matrix. This significantly improves upon the previous algorithms, which required Ο•k≀O(Ξ»k+1/k)\phi_k \le O(\lambda_{k+1}/k).Comment: Appeared in SODA 201

    Tensor Spectral Clustering for Partitioning Higher-order Network Structures

    Full text link
    Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.Comment: SDM 201
    • …
    corecore