6 research outputs found

    Applying subclustering and Lp distance in Weighted K-Means with distributed centroids

    Get PDF
    We consider the Weighted K-Means algorithm with distributed centroids aimed at clustering data sets with numerical, categorical and mixed types of data. Our approach allows given features (i.e., variables) to have different weights at different clusters. Thus, it supports the intuitive idea that features may have different degrees of relevance at different clusters. We use the Minkowski metric in a way that feature weights become feature re-scaling factors for any considered exponent. Moreover, the traditional Silhouette clustering validity index was adapted to deal with both numerical and categorical types of features. Finally, we show that our new method usually outperforms traditional K-Means as well as the recently proposed WK-DC clustering algorithm.Peer reviewe

    An Improved Crow Search Algorithm for Data Clustering

    Get PDF
    Metaheuristic algorithms are often trapped in local optimum solutions when searching for solutions. This problem often occurs in optimization cases involving high dimensions such as data clustering. Imbalance of the exploration and exploitation process is the cause of this condition because search agents are not able to reach the best solution in the search space. In this study, the problem is overcome by modifying the solution update mechanism so that a search agent not only follows another randomly chosen search agent, but also has the opportunity to follow the best search agent. In addition, the balance of exploration and exploitation is also enhanced by the mechanism of updating the awareness probability of each search agent in accordance with their respective abilities in searching for solutions. The improve mechanism makes the proposed algorithm obtain pretty good solutions with smaller computational time compared to Genetic Algorithm and Particle Swarm Optimization. In large datasets, it is proven that the proposed algorithm is able to provide the best solution among the other algorithms

    FAST K-MEANS COLOR IMAGE CLUSTERING WITH NORMALIZED DISTANCE VALUES

    Get PDF
    Image segmentation is an intermediate image processing stage in which the pixels of the image are grouped into clusters such that the data resulted from this stage is more meaningful for the next stage. Many clustering methods are used widely to segment the images. For this purpose, most clustering methods use the features of the image pixels. While some clustering method consider the local features of images by taking into account the neighborhood system of the pixels, some consider the global features of images. The algorithm of K-means clustering method, that is easy to understand and simple to put into practice, performs by considering the global features of the entire image. In this algorithm, the number of cluster is given by users initially as an input value. For the segmentation, if the distribution of the pixels on a histogram is used, the algorithm runs faster. The values in the histogram must be discrete in a certain range. In this paper, we use the Euclidean distance between the color values of the pixels and the mean color values of the entire image for taking advantage of the every color values of the pixels. To obtain a histogram that consists of discrete values, we normalize the distance value in a specific range and round the values to the nearest integer for discretization. We tested the versions of K-means with the gray level value histogram and the distance value histogram on an urban image dataset getting from ISPRS WG III/4 2D Semantic Labeling dataset. When comparing the two histograms, the distance value histogram that is proposed in this paper is better than the gray level value histogram

    On k-means iterations and Gaussian clusters

    Get PDF
    Nowadays, k-means remains arguably the most popular clustering algorithm (Jain, 2010; Vouros et al., 2021). Two of its main properties are simplicity and speed in practice. Here, our main claim is that the average number of iterations k-means takes to converge (τ¯) is in fact very informative. We find this to be particularly interesting because τ¯ is always known when applying k-means but has never been, to our knowledge, used in the data analysis process. By experimenting with Gaussian clusters, we show that τ¯ is related to the structure of a data set under study. Data sets containing Gaussian clusters have a much lower τ¯ than those containing uniformly random data. In fact, we go considerably further and demonstrate a pattern of inverse correlation between τ¯ and the clustering quality. We illustrate the importance of our findings through two practical applications. First, we describe the cases in which τ¯ can be effectively used to identify irrelevant features present in a given data set or be used to improve the results of existing feature selection algorithms. Second, we show that there is a strong relationship between τ¯ and the number of clusters in a data set, and that this relationship can be used to find the true number of clusters it contains

    Online content clustering using variant K-Means Algorithms

    Get PDF
    Thesis (MTech)--Cape Peninsula University of Technology, 2019We live at a time when so much information is created. Unfortunately, much of the information is redundant. There is a huge amount of online information in the form of news articles that discuss similar stories. The number of articles is projected to grow. The growth makes it difficult for a person to process all that information in order to update themselves on a subject matter. There is an overwhelming amount of similar information on the internet. There is need for a solution that can organize this similar information into specific themes. The solution is a branch of Artificial intelligence (AI) called machine learning (ML) using clustering algorithms. This refers to clustering groups of information that is similar into containers. When the information is clustered people can be presented with information on their subject of interest, grouped together. The information in a group can be further processed into a summary. This research focuses on unsupervised learning. Literature has it that K-Means is one of the most widely used unsupervised clustering algorithm. K-Means is easy to learn, easy to implement and is also efficient. However, there is a horde of variations of K-Means. The research seeks to find a variant of K-Means that can be used with an acceptable performance, to cluster duplicate or similar news articles into correct semantic groups. The research is an experiment. News articles were collected from the internet using gocrawler. gocrawler is a program that takes Universal Resource Locators (URLs) as an argument and collects a story from a website pointed to by the URL. The URLs are read from a repository. The stories come riddled with adverts and images from the web page. This is referred to as a dirty text. The dirty text is sanitized. Sanitization is basically cleaning the collected news articles. This includes removing adverts and images from the web page. The clean text is stored in a repository, it is the input for the algorithm. The other input is the K value. All K-Means based variants take K value that defines the number of clusters to be produced. The stories are manually classified and labelled. The labelling is done to check the accuracy of machine clustering. Each story is labelled with a class to which it belongs. The data collection process itself was not unsupervised but the algorithms used to cluster are totally unsupervised. A total of 45 stories were collected and 9 manual clusters were identified. Under each manual cluster there are sub clusters of stories talking about one specific event. The performance of all the variants is compared to see the one with the best clustering results. Performance was checked by comparing the manual classification and the clustering results from the algorithm. Each K-Means variant is run on the same set of settings and same data set, that is 45 stories. The settings used are, • Dimensionality of the feature vectors, • Window size, • Maximum distance between the current and predicted word in a sentence, • Minimum word frequency, • Specified range of words to ignore, • Number of threads to train the model. • The training algorithm either distributed memory (PV-DM) or distributed bag of words (PV-DBOW), • The initial learning rate. The learning rate decreases to minimum alpha as training progresses, • Number of iterations per cycle, • Final learning rate, • Number of clusters to form, • The number of times the algorithm will be run, • The method used for initialization. The results obtained show that K-Means can perform better than K-Modes. The results are tabulated and presented in graphs in chapter six. Clustering can be improved by incorporating Named Entity (NER) recognition into the K-Means algorithms. Results can also be improved by implementing multi-stage clustering technique. Where initial clustering is done then you take the cluster group and further cluster it to achieve finer clustering results
    corecore