50 research outputs found

    Efficient Mining of Heterogeneous Star-Structured Data

    Get PDF
    Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing star- structured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive exper- iments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm

    Valuable Feature Improvement of Content Clustering and Categorization via Metadata

    Get PDF
    Every record contains side-data in content mining application. This side data may be of particular sorts, for instance, record derivation information, the links in the record, user access conduct from web logs, or other non text based qualities which are embedded into the content record. With the finished objective of clustering this behaviors (Text) contains huge measure of information. At times it is difficult to estimate the side data, in light of the way that a part of the information is noise. In such cases, it can be risky to combine side-information into the mining method, because it can either upgrade the nature of the illustration for the mining procedure, or can include noise to the approach. As needs be, we oblige a principled way to deal with perform the mining handle, so as to enlarge the inclinations from using this side information. In this subject, here figure k-medoids estimation which vanquishes the problem of k-means computation. We plan an algorithm which consolidates established parceling algorithm with probabilistic models to make a successful clustering methodology. And afterward demonstrate to extend the way to deal with the sorting issue. This general technique is used as a piece of demand to summarize both clustering what as more instruction algorithms. So the use of side-data can massively enhance the way of substance clustering and sorting, while keeping up an unusual state of efficiency. After that we put entire framework in cloud. DOI: 10.17762/ijritcc2321-8169.15078

    Evolutionary star-structured heterogeneous data co-clustering

    Get PDF
    A star-structured interrelationship, which is a more common type in real world data, has a central object connected to the other types of objects. One of the key challenges in evolutionary clustering is integration of historical data in current data. Traditionally, smoothness in data transition over a period of time is achieved by means of cost functions defined over historical and current data. These functions provide a tunable tolerance for shifts of current data accounting instance to all historical information for corresponding instance. Once historical data is integrated into current data using cost functions, co-clustering is obtained using various co-clustering algorithms like spectral clustering, non-negative matrix factorization, and information theory based clustering. Non-negative matrix factorization has been proven efficient and scalable for large data and is less memory intensive compared to other approaches. Non-negative matrix factorization tri-factorizes original data matrix into row indicator matrix, column indicator matrix, and a matrix that provides correlation between the row and column clusters. However, challenges in clustering evolving heterogeneous data have never been addressed. In this thesis, I propose a new algorithm for clustering a specific case of this problem, viz. the star-structured heterogeneous data. The proposed algorithm will provide cost functions to integrate historical star-structured heterogeneous data into current data. Then I will use non-negative matrix factorization to cluster each time-step of instances and features. This contribution to the field will provide an avenue for further development of higher order evolutionary co-clustering algorithms

    Relevance Search via Bipolar Label Diffusion on Bipartite Graphs

    Get PDF
    The task of relevance search is to find relevant items to some given queries which can be viewed either as an information retrieval problem or as a semi-supervised learning problem In order to combine both of their advantages we develop a new relevance search method using label diffusion on bipartite graphs And we propose a heat diffusion-based algorithm namely bipartite label diffusion BLD Our method yields encouraging experimental results on a number of relevance search problem

    Evolutionary spectral co-clustering

    Get PDF
    The field of mining evolving data is relatively new and evolutionary clustering is among the latest in this trend. Presently, there are algorithms for evolutionary k-means, agglomerative hierarchical, and spectral clustering. These have been excellent in showing the advantages of using evolving data snapshots for better clustering results. From these algorithms the key portion of the conversion from static data handling to evolving data handling has been the addition of the historical cost function. The cost function is what determines whether or not instances should be moved from one cluster to the next between time-steps based on the historical cuts made between the instances in the dataset. These cost functions are then the method by which evolutionary clustering provides smooth transitions as there is a tunable tolerance for shifts in cluster membership. This also means that transitions between clusters become much more significant. For example, if an author-word matrix were clustered over ten years and an author changed clusters part way through the time-line it is a likely indicator that the author has changed research topics. Methods for mining evolving data have not yet expanded into co-clustering; for this reason I have contributed a new algorithm for co-clustering evolving data. The algorithm uses spectral co-clustering to cluster each time-step of instances and features. Using the previous example, cluster changes in features (or words) for an author-word matrix is significant in that it may indicate a change in meaning for the word. This contribution to the field provides an avenue for further development of evolutionary co-clustering algorithms

    Tumor Extraction and Volume Estimation for T1-Weighted Magnetic Resonance Brain Images

    Get PDF
    Magnetic resonance imaging (MRI) is a significant imaging technology for brain tumor diagnosis because physicians can identify precise pathologies by studying the variations of tissue characteristics that occurs in various kinds of MR images. Segmentation of MRI is a pre-process in determining the volume of different brain tissues, but here tumor detection is of primary concern. We proposed a method to extract tumors as seen through MR brain images using co-clustering and morphological operations and its volume estimation was done by Cavalier2019;s estimator of morphometric volume method. Quantitative analysis showed that the proposed method yielded better results in comparison with fuzzy c-means algorithm (FCM

    Tensor Spectral Clustering for Partitioning Higher-order Network Structures

    Full text link
    Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.Comment: SDM 201

    Data Clustering And Visualization Through Matrix Factorization

    Get PDF
    Clustering is traditionally an unsupervised task which is to find natural groupings or clusters in multidimensional data based on perceived similarities among the patterns. The purpose of clustering is to extract useful information from unlabeled data. In order to present the extracted useful knowledge obtained by clustering in a meaningful way, data visualization becomes a popular and growing area of research field. Visualization can provide a qualitative overview of large and complex data sets, which help us the desired insight in truly understanding the phenomena of interest in data. The contribution of this dissertation is two-fold: Semi-Supervised Non-negative Matrix Factorization (SS-NMF) for data clustering/co-clustering and Exemplar-based data Visualization (EV) through matrix factorization. Compared to traditional data mining models, matrix-based methods are fast, easy to understand and implement, especially suitable to solve large-scale challenging problems in text mining, image grouping, medical diagnosis, and bioinformatics. In this dissertation, we present two effective matrix-based solutions in the new directions of data clustering and visualization. First, in many practical learning domains, there is a large supply of unlabeled data but limited labeled data, and in most cases it might be expensive to generate large amounts of labeled data. Traditional clustering algorithms completely ignore these valuable labeled data and thus are inapplicable to these problems. Consequently, semi-supervised clustering, which can incorporate the domain knowledge to guide a clustering algorithm, has become a topic of significant recent interest. Thus, we develop a Non-negative Matrix Factorization (NMF) based framework to incorporate prior knowledge into data clustering. Moreover, with the fast growth of Internet and computational technologies in the past decade, many data mining applications have advanced swiftly from the simple clustering of one data type to the co-clustering of multiple data types, usually involving high heterogeneity. To this end, we extend SS-NMF to perform heterogeneous data co-clustering. From a theoretical perspective, SS-NMF for data clustering/co-clustering is mathematically rigorous. The convergence and correctness of our algorithms are proved. In addition, we discuss the relationship between SS-NMF with other well-known clustering and co-clustering models. Second, most of current clustering models only provide the centroids (e.g., mathematical means of the clusters) without inferring the representative exemplars from real data, thus they are unable to better summarize or visualize the raw data. A new method, Exemplar-based Visualization (EV), is proposed to cluster and visualize an extremely large-scale data. Capitalizing on recent advances in matrix approximation and factorization, EV provides a means to visualize large scale data with high accuracy (in retaining neighbor relations), high efficiency (in computation), and high flexibility (through the use of exemplars). Empirically, we demonstrate the superior performance of our matrix-based data clustering and visualization models through extensive experiments performed on the publicly available large scale data sets

    Semi-supervised heterogeneous fusion for multimedia data co-clustering

    Get PDF
    corecore