1,747 research outputs found
Evolutionary star-structured heterogeneous data co-clustering
A star-structured interrelationship, which is a more common type in real world data, has a central object connected to the other types of objects. One of the key challenges in evolutionary clustering is integration of historical data in current data. Traditionally, smoothness in data transition over a period of time is achieved by means of cost functions defined over historical and current data. These functions provide a tunable tolerance for shifts of current data accounting instance to all historical information for corresponding instance. Once historical data is integrated into current data using cost functions, co-clustering is obtained using various co-clustering algorithms like spectral clustering, non-negative matrix factorization, and information theory based clustering. Non-negative matrix factorization has been proven efficient and scalable for large data and is less memory intensive compared to other approaches. Non-negative matrix factorization tri-factorizes original data matrix into row indicator matrix, column indicator matrix, and a matrix that provides correlation between the row and column clusters. However, challenges in clustering evolving heterogeneous data have never been addressed. In this thesis, I propose a new algorithm for clustering a specific case of this problem, viz. the star-structured heterogeneous data. The proposed algorithm will provide cost functions to integrate historical star-structured heterogeneous data into current data. Then I will use non-negative matrix factorization to cluster each time-step of instances and features. This contribution to the field will provide an avenue for further development of higher order evolutionary co-clustering algorithms
Recovery Guarantees for Quadratic Tensors with Limited Observations
We consider the tensor completion problem of predicting the missing entries
of a tensor. The commonly used CP model has a triple product form, but an
alternate family of quadratic models which are the sum of pairwise products
instead of a triple product have emerged from applications such as
recommendation systems. Non-convex methods are the method of choice for
learning quadratic models, and this work examines their sample complexity and
error guarantee. Our main result is that with the number of samples being only
linear in the dimension, all local minima of the mean squared error objective
are global minima and recover the original tensor accurately. The techniques
lead to simple proofs showing that convex relaxation can recover quadratic
tensors provided with linear number of samples. We substantiate our theoretical
results with experiments on synthetic and real-world data, showing that
quadratic models have better performance than CP models in scenarios where
there are limited amount of observations available
Clustering Boolean Tensors
Tensor factorizations are computationally hard problems, and in particular,
are often significantly harder than their matrix counterparts. In case of
Boolean tensor factorizations -- where the input tensor and all the factors are
required to be binary and we use Boolean algebra -- much of that hardness comes
from the possibility of overlapping components. Yet, in many applications we
are perfectly happy to partition at least one of the modes. In this paper we
investigate what consequences does this partitioning have on the computational
complexity of the Boolean tensor factorizations and present a new algorithm for
the resulting clustering problem. This algorithm can alternatively be seen as a
particularly regularized clustering algorithm that can handle extremely
high-dimensional observations. We analyse our algorithms with the goal of
maximizing the similarity and argue that this is more meaningful than
minimizing the dissimilarity. As a by-product we obtain a PTAS and an efficient
0.828-approximation algorithm for rank-1 binary factorizations. Our algorithm
for Boolean tensor clustering achieves high scalability, high similarity, and
good generalization to unseen data with both synthetic and real-world data
sets
- …