13,024 research outputs found

    Efficient Mining of Heterogeneous Star-Structured Data

    Get PDF
    Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing star- structured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive exper- iments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm

    Evolutionary star-structured heterogeneous data co-clustering

    Get PDF
    A star-structured interrelationship, which is a more common type in real world data, has a central object connected to the other types of objects. One of the key challenges in evolutionary clustering is integration of historical data in current data. Traditionally, smoothness in data transition over a period of time is achieved by means of cost functions defined over historical and current data. These functions provide a tunable tolerance for shifts of current data accounting instance to all historical information for corresponding instance. Once historical data is integrated into current data using cost functions, co-clustering is obtained using various co-clustering algorithms like spectral clustering, non-negative matrix factorization, and information theory based clustering. Non-negative matrix factorization has been proven efficient and scalable for large data and is less memory intensive compared to other approaches. Non-negative matrix factorization tri-factorizes original data matrix into row indicator matrix, column indicator matrix, and a matrix that provides correlation between the row and column clusters. However, challenges in clustering evolving heterogeneous data have never been addressed. In this thesis, I propose a new algorithm for clustering a specific case of this problem, viz. the star-structured heterogeneous data. The proposed algorithm will provide cost functions to integrate historical star-structured heterogeneous data into current data. Then I will use non-negative matrix factorization to cluster each time-step of instances and features. This contribution to the field will provide an avenue for further development of higher order evolutionary co-clustering algorithms

    Semi-supervised heterogeneous evolutionary co-clustering

    Get PDF
    One of the challenges of the machine learning problem is the absence of sufficient number of labeled instances or training instances. At the same time generating labeled data is expensive and time consuming. The semi-supervised approach has shown promising results to solve the problem of insufficient or fewer labeled instance datasets. The key challenge is incorporating the semi-supervised knowledge into the heterogeneous data which is evolving in nature. Most of the prior work that uses semi-supervised knowledge has been performed on heterogeneous static data. The semi-supervised knowledge is incorporated into data which aid the clustering algorithm to obtain better clusters. The semi-supervised knowledge is provided as constrained based or distance based. I am proposing a framework to incorporate prior knowledge to perform co-clustering on the evolving heterogeneous data. This framework can be used to solve a wide range of problems dealing with text analysis, web analysis and image grouping. In the semi-supervised approach we incorporate the domain knowledge by placing the constraints which aid the clustering process in performing effective clustering of the data. In the proposed framework, I am using the constraint based semi-supervised non-negative matrix factorization approach to obtain the co-clustering on the heterogeneous evolving data. The constraint based semi-supervised approach uses the user provided must-link or cannot-link constraints on the central data type before performing co-clustering. To process the original datasets efficiently in terms of time and space I am using the low rank approximation technique to obtain the sparse representation of the input data matrix using the Dynamic Colibri approach

    Graph Summarization

    Full text link
    The continuous and rapid growth of highly interconnected datasets, which are both voluminous and complex, calls for the development of adequate processing and analytical techniques. One method for condensing and simplifying such datasets is graph summarization. It denotes a series of application-specific algorithms designed to transform graphs into more compact representations while preserving structural patterns, query answers, or specific property distributions. As this problem is common to several areas studying graph topologies, different approaches, such as clustering, compression, sampling, or influence detection, have been proposed, primarily based on statistical and optimization methods. The focus of our chapter is to pinpoint the main graph summarization methods, but especially to focus on the most recent approaches and novel research trends on this topic, not yet covered by previous surveys.Comment: To appear in the Encyclopedia of Big Data Technologie

    Data mining and fusion

    No full text
    • …
    corecore