311 research outputs found

    Thinning out redundant empirical data

    Full text link
    Given a set XX of "empirical" points, whose coordinates are perturbed by errors, we analyze whether it contains redundant information, that is whether some of its elements could be represented by a single equivalent point. If this is the case, the empirical information associated to XX could be described by fewer points, chosen in a suitable way. We present two different methods to reduce the cardinality of XX which compute a new set of points equivalent to the original one, that is representing the same empirical information. Though our algorithms use some basic notions of Cluster Analysis they are specifically designed for "thinning out" redundant data. We include some experimental results which illustrate the practical effectiveness of our methods.Comment: 14 pages; 3 figure

    Linear, Deterministic, and Order-Invariant Initialization Methods for the K-Means Clustering Algorithm

    Full text link
    Over the past five decades, k-means has become the clustering algorithm of choice in many application domains primarily due to its simplicity, time/space efficiency, and invariance to the ordering of the data points. Unfortunately, the algorithm's sensitivity to the initial selection of the cluster centers remains to be its most serious drawback. Numerous initialization methods have been proposed to address this drawback. Many of these methods, however, have time complexity superlinear in the number of data points, which makes them impractical for large data sets. On the other hand, linear methods are often random and/or sensitive to the ordering of the data points. These methods are generally unreliable in that the quality of their results is unpredictable. Therefore, it is common practice to perform multiple runs of such methods and take the output of the run that produces the best results. Such a practice, however, greatly increases the computational requirements of the otherwise highly efficient k-means algorithm. In this chapter, we investigate the empirical performance of six linear, deterministic (non-random), and order-invariant k-means initialization methods on a large and diverse collection of data sets from the UCI Machine Learning Repository. The results demonstrate that two relatively unknown hierarchical initialization methods due to Su and Dy outperform the remaining four methods with respect to two objective effectiveness criteria. In addition, a recent method due to Erisoglu et al. performs surprisingly poorly.Comment: 21 pages, 2 figures, 5 tables, Partitional Clustering Algorithms (Springer, 2014). arXiv admin note: substantial text overlap with arXiv:1304.7465, arXiv:1209.196

    Serendipity Identification Using Distance-Based Approach

    Get PDF
    The recommendation system is a method for helping consumers to find products that fit their preferences. However, recommendations that are merely based on user preference are no longer satisfactory. Consumers expect recommendations that are novel, unexpected, and relevant. It requires the development of a serendipity recommendation system that matches the serendipity data character. However, there are still debates among researchers about the available common definition of serendipity. Therefore, our study proposes a work to identify serendipity data's character by directly using serendipity data ground truth from the famous Movielens dataset. The serendipity data identification is based on a distance-based approach using collaborative filtering and k-means clustering algorithms. Collaborative filtering is used to calculate the similarity value between data, while k-means is used to cluster the collaborative filtering data. The resulting clusters are used to determine the position of the serendipity cluster. The result of this study shows that the average distance between the recommended movie cluster and the serendipity movie cluster is 0.85 units, which is neither the closest cluster nor the farthest cluster from the recommended movie cluster

    Clustering Algorithms: Their Application to Gene Expression Data

    Get PDF
    Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and iden-tify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure

    Optimizing K-Means by Fixing Initial Cluster Centers

    Get PDF
    Abstract Data mining techniques help in business decisio

    Optimizing K-Means by Fixing Initial Cluster Centers

    Get PDF
    Abstract Data mining techniques help in business decisio

    Developing Models to Visualize and Analyze User Interaction for Financial Technology Websites

    Get PDF
    Vestigo Ventures manually processes website traffic data to analyze the business performance of financial technology companies. By analyzing how people navigate through company websites, Vestigo aims to understand different customer activity patterns. Our team designed and implemented a tool that automatically processes clickstream data to visualize different customer activity within a website and compute statistics about user activity. This tool will provide Vestigo insight on the effectiveness of their clients’ website structures and help them make recommendations to their clients

    Comparison of different similarity measures in hierarchical clustering

    Get PDF
    The management of datasets containing heterogeneous types of data is a crucial point in the context of precision medicine, where genetic, environmental, and life-style information of each individual has to be analyzed simultaneously. Clustering represents a powerful method, used in data mining, for extracting new useful knowledge from unlabeled datasets. Clustering methods are essentially distance-based, since they measure the similarity (or the distance) between two elements or one element and the cluster centroid. However, the selection of the distance metric is not a trivial task: it could influence the clustering results and, thus, the extracted information. In this study we analyze the impact of four similarity measures (Manhattan or L1 distance, Euclidean or L2 distance, Chebyshev or L∞ distance and Gower distance) on the clustering results obtained for datasets containing different types of variables. We applied hierarchical clustering combined with an automatic cut point selection method to six datasets publicly available on the UCI Repository. Four different clusterizations were obtained for every dataset (one for each distance) and were analyzed in terms of number of clusters, number of elements in each cluster, and cluster centroids. Our results showed that changing the distance metric produces substantial modifications in the obtained clusters. This behavior is particularly evident for datasets containing heterogeneous variables. Thus, the choice of the distance measure should not be done a-priori but evaluated according to the set of data to be analyzed and the task to be accomplished

    DIMK-means" Distance-based Initialization Method for K-means Clustering Algorithm"

    Get PDF
    Partition-based clustering technique is one of several clustering techniques that attempt to directly decompose the dataset into a set of disjoint clusters. K-means algorithm dependence on partition-based clustering technique is popular and widely used and applied to a variety of domains. K-means clustering results are extremely sensitive to the initial centroid; this is one of the major drawbacks of k-means algorithm. Due to such sensitivity; several different initialization approaches were proposed for the K-means algorithm in the last decades. This paper proposes a selection method for initial cluster centroid in K-means clustering instead of the random selection method. Research provides a detailed performance assessment of the proposed initialization method over many datasets with different dimensions, numbers of observations, groups and clustering complexities. Ability to identify the true clusters is the performance evaluation standard in this research. The experimental results show that the proposed initialization method is more effective and converges to more accurate clustering results than those of the random initialization method
    • …
    corecore