Search CORE

113 research outputs found

Efficient Mining of Heterogeneous Star-Structured Data

Author: Rege Manjeet
Yu Qi
Publication venue: RIT Scholar Works
Publication date: 01/01/2008
Field of study

Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing star- structured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive exper- iments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm

CiteSeerX

RIT Scholar Works

Valuable Feature Improvement of Content Clustering and Categorization via Metadata

Author: Shivane Padmaja, Prof. Sonali Patil
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/07/2015
Field of study

Every record contains side-data in content mining application. This side data may be of particular sorts, for instance, record derivation information, the links in the record, user access conduct from web logs, or other non text based qualities which are embedded into the content record. With the finished objective of clustering this behaviors (Text) contains huge measure of information. At times it is difficult to estimate the side data, in light of the way that a part of the information is noise. In such cases, it can be risky to combine side-information into the mining method, because it can either upgrade the nature of the illustration for the mining procedure, or can include noise to the approach. As needs be, we oblige a principled way to deal with perform the mining handle, so as to enlarge the inclinations from using this side information. In this subject, here figure k-medoids estimation which vanquishes the problem of k-means computation. We plan an algorithm which consolidates established parceling algorithm with probabilistic models to make a successful clustering methodology. And afterward demonstrate to extend the way to deal with the sorting issue. This general technique is used as a piece of demand to summarize both clustering what as more instruction algorithms. So the use of side-data can massively enhance the way of substance clustering and sorting, while keeping up an unusual state of efficiency. After that we put entire framework in cloud. DOI: 10.17762/ijritcc2321-8169.15078

International Journal on Recent and Innovation Trends in Computing and Communication

Evolutionary star-structured heterogeneous data co-clustering

Author: Salunke Amit
Publication venue: RIT Scholar Works
Publication date: 01/11/2012
Field of study

A star-structured interrelationship, which is a more common type in real world data, has a central object connected to the other types of objects. One of the key challenges in evolutionary clustering is integration of historical data in current data. Traditionally, smoothness in data transition over a period of time is achieved by means of cost functions defined over historical and current data. These functions provide a tunable tolerance for shifts of current data accounting instance to all historical information for corresponding instance. Once historical data is integrated into current data using cost functions, co-clustering is obtained using various co-clustering algorithms like spectral clustering, non-negative matrix factorization, and information theory based clustering. Non-negative matrix factorization has been proven efficient and scalable for large data and is less memory intensive compared to other approaches. Non-negative matrix factorization tri-factorizes original data matrix into row indicator matrix, column indicator matrix, and a matrix that provides correlation between the row and column clusters. However, challenges in clustering evolving heterogeneous data have never been addressed. In this thesis, I propose a new algorithm for clustering a specific case of this problem, viz. the star-structured heterogeneous data. The proposed algorithm will provide cost functions to integrate historical star-structured heterogeneous data into current data. Then I will use non-negative matrix factorization to cluster each time-step of instances and features. This contribution to the field will provide an avenue for further development of higher order evolutionary co-clustering algorithms

RIT Scholar Works

Relevance Search via Bipolar Label Diffusion on Bipartite Graphs

Author: Dr. Zhang Liang
Publication venue: Global Journals Inc. (US)
Publication date: 15/01/2012
Field of study

The task of relevance search is to find relevant items to some given queries which can be viewed either as an information retrieval problem or as a semi-supervised learning problem In order to combine both of their advantages we develop a new relevance search method using label diffusion on bipartite graphs And we propose a heat diffusion-based algorithm namely bipartite label diffusion BLD Our method yields encouraging experimental results on a number of relevance search problem

Global Journal of Computer Science and Technology (GJCST)

How to Round Subspaces: A New Spectral Clustering Algorithm

Author: Sinop Ali Kemal
Publication venue
Publication date: 19/10/2015
Field of study

A basic problem in spectral clustering is the following. If a solution obtained from the spectral relaxation is close to an integral solution, is it possible to find this integral solution even though they might be in completely different basis? In this paper, we propose a new spectral clustering algorithm. It can recover a

k

-partition such that the subspace corresponding to the span of its indicator vectors is

O(\sqrt{opt})

close to the original subspace in spectral norm with

opt

being the minimum possible (

opt \le 1

always). Moreover our algorithm does not impose any restriction on the cluster sizes. Previously, no algorithm was known which could find a

k

-partition closer than

o(k \cdot opt)

. We present two applications for our algorithm. First one finds a disjoint union of bounded degree expanders which approximate a given graph in spectral norm. The second one is for approximating the sparsest

k

-partition in a graph where each cluster have expansion at most

\phi_k

provided

\phi_k \le O(\lambda_{k+1})

where

\lambda_{k+1}

is the

(k+1)^{st}

eigenvalue of Laplacian matrix. This significantly improves upon the previous algorithms, which required

\phi_k \le O(\lambda_{k+1}/k)

.Comment: Appeared in SODA 201

arXiv.org e-Print Archive

Crossref

Tensor Spectral Clustering for Partitioning Higher-order Network Structures

Author: Benson Austin R.
Gleich David F.
Leskovec Jure
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 17/02/2015
Field of study

Spectral graph theory-based methods represent an important class of tools for studying the structure of networks. Spectral methods are based on a first-order Markov chain derived from a random walk on the graph and thus they cannot take advantage of important higher-order network substructures such as triangles, cycles, and feed-forward loops. Here we propose a Tensor Spectral Clustering (TSC) algorithm that allows for modeling higher-order network structures in a graph partitioning framework. Our TSC algorithm allows the user to specify which higher-order network structures (cycles, feed-forward loops, etc.) should be preserved by the network clustering. Higher-order network structures of interest are represented using a tensor, which we then partition by developing a multilinear spectral method. Our framework can be applied to discovering layered flows in networks as well as graph anomaly detection, which we illustrate on synthetic networks. In directed networks, a higher-order structure of particular interest is the directed 3-cycle, which captures feedback loops in networks. We demonstrate that our TSC algorithm produces large partitions that cut fewer directed 3-cycles than standard spectral clustering algorithms.Comment: SDM 201

arXiv.org e-Print Archive

Crossref