311 research outputs found
Thinning out redundant empirical data
Given a set of "empirical" points, whose coordinates are perturbed by
errors, we analyze whether it contains redundant information, that is whether
some of its elements could be represented by a single equivalent point. If this
is the case, the empirical information associated to could be described by
fewer points, chosen in a suitable way. We present two different methods to
reduce the cardinality of which compute a new set of points equivalent to
the original one, that is representing the same empirical information. Though
our algorithms use some basic notions of Cluster Analysis they are specifically
designed for "thinning out" redundant data. We include some experimental
results which illustrate the practical effectiveness of our methods.Comment: 14 pages; 3 figure
Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm
Over the past five decades, k-means has become the clustering algorithm of
choice in many application domains primarily due to its simplicity, time/space
efficiency, and invariance to the ordering of the data points. Unfortunately,
the algorithm's sensitivity to the initial selection of the cluster centers
remains to be its most serious drawback. Numerous initialization methods have
been proposed to address this drawback. Many of these methods, however, have
time complexity superlinear in the number of data points, which makes them
impractical for large data sets. On the other hand, linear methods are often
random and/or sensitive to the ordering of the data points. These methods are
generally unreliable in that the quality of their results is unpredictable.
Therefore, it is common practice to perform multiple runs of such methods and
take the output of the run that produces the best results. Such a practice,
however, greatly increases the computational requirements of the otherwise
highly efficient k-means algorithm. In this chapter, we investigate the
empirical performance of six linear, deterministic (non-random), and
order-invariant k-means initialization methods on a large and diverse
collection of data sets from the UCI Machine Learning Repository. The results
demonstrate that two relatively unknown hierarchical initialization methods due
to Su and Dy outperform the remaining four methods with respect to two
objective effectiveness criteria. In addition, a recent method due to Erisoglu
et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms
(Springer, 2014). arXiv admin note: substantial text overlap with
arXiv:1304.7465, arXiv:1209.196
Serendipity Identification Using Distance-Based Approach
The recommendation system is a method for helping consumers to find products that fit their preferences. However, recommendations that are merely based on user preference are no longer satisfactory. Consumers expect recommendations that are novel, unexpected, and relevant. It requires the development of a serendipity recommendation system that matches the serendipity data character. However, there are still debates among researchers about the available common definition of serendipity. Therefore, our study proposes a work to identify serendipity data's character by directly using serendipity data ground truth from the famous Movielens dataset. The serendipity data identification is based on a distance-based approach using collaborative filtering and k-means clustering algorithms. Collaborative filtering is used to calculate the similarity value between data, while k-means is used to cluster the collaborative filtering data. The resulting clusters are used to determine the position of the serendipity cluster. The result of this study shows that the average distance between the recommended movie cluster and the serendipity movie cluster is 0.85 units, which is neither the closest cluster nor the farthest cluster from the recommended movie cluster
Clustering Algorithms: Their Application to Gene Expression Data
Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure
Optimizing K-Means by Fixing Initial Cluster Centers
Abstract Data mining techniques help in business decisio
Optimizing K-Means by Fixing Initial Cluster Centers
Abstract Data mining techniques help in business decisio
Developing Models to Visualize and Analyze User Interaction for Financial Technology Websites
Vestigo Ventures manually processes website traffic data to analyze the business performance of financial technology companies. By analyzing how people navigate through company websites, Vestigo aims to understand different customer activity patterns. Our team designed and implemented a tool that automatically processes clickstream data to visualize different customer activity within a website and compute statistics about user activity. This tool will provide Vestigo insight on the effectiveness of their clients’ website structures and help them make recommendations to their clients
Comparison of different similarity measures in hierarchical clustering
The management of datasets containing heterogeneous types of data is a crucial point in the context of precision medicine, where genetic, environmental, and life-style information of each individual has to be analyzed simultaneously. Clustering represents a powerful method, used in data mining, for extracting new useful knowledge from unlabeled datasets. Clustering methods are essentially distance-based, since they measure the similarity (or the distance) between two elements or one element and the cluster centroid. However, the selection of the distance metric is not a trivial task: it could influence the clustering results and, thus, the extracted information. In this study we analyze the impact of four similarity measures (Manhattan or L1 distance, Euclidean or L2 distance, Chebyshev or L∞ distance and Gower distance) on the clustering results obtained for datasets containing different types of variables. We applied hierarchical clustering combined with an automatic cut point selection method to six datasets publicly available on the UCI Repository. Four different clusterizations were obtained for every dataset (one for each distance) and were analyzed in terms of number of clusters, number of elements in each cluster, and cluster centroids. Our results showed that changing the distance metric produces substantial modifications in the obtained clusters. This behavior is particularly evident for datasets containing heterogeneous variables. Thus, the choice of the distance measure should not be done a-priori but evaluated according to the set of data to be analyzed and the task to be accomplished
DIMK-means" Distance-based Initialization Method for K-means Clustering Algorithm"
Partition-based clustering technique is one of several clustering techniques that attempt to directly decompose the dataset into a set of disjoint clusters. K-means algorithm dependence on partition-based clustering technique is popular and widely used and applied to a variety of domains. K-means clustering results are extremely sensitive to the initial centroid; this is one of the major drawbacks of k-means algorithm. Due to such sensitivity; several different initialization approaches were proposed for the K-means algorithm in the last decades. This paper proposes a selection method for initial cluster centroid in K-means clustering instead of the random selection method. Research provides a detailed performance assessment of the proposed initialization method over many datasets with different dimensions, numbers of observations, groups and clustering complexities. Ability to identify the true clusters is the performance evaluation standard in this research. The experimental results show that the proposed initialization method is more effective and converges to more accurate clustering results than those of the random initialization method
- …