As the amount of data generated and collected becomes larger and more complicated, scalable tools for effective data mining become more important. Because data mining is a broad area of research, we focus on coclustering, a new yet promising subarea of data mining. Co-clustering performs two-way clusterings and simultaneously clusters row and column entities. Specifically we consider a recently proposed algorithm, Bregman co-clustering algorithm, which has shown significant promise for clustering quality and hence gained popularity. However, a recent study has demonstrated that a main memory based implementation of the algorithm creates difficulty for applications with large data sets. Therefore, the focus of this paper is a scalable implementation of Bregman co-clustering algorithm. We discuss how summary statistics required by the algorithm can be stored in a data cube and computed by an OLAP engine. We conduct experiments using several real data sets from various domains, including matrix decomposition, bioinformatics, document clustering, and collaborative filtering (CF) based recommendation. Experimental results demonstrate the potential of our OLAP based implementation. Moreover, our implementation has a large prospective user base as it works on top of OLAP, a widely deployed technique; further, it facilitates the use of Bregman co-clustering algorithm for applications with large data sets while co-clustering is finding applications in various problems. The research is a step towards the increasing needs in connecting database and data mining
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.