276,454 research outputs found

    Online kk-means Clustering

    Get PDF
    International audienceWe study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of kk clusters. The specific formulation we use is the kk-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the kk-means objective (C\mathcal{C}) in hindsight. We show that provided the data lies in a bounded region, an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of O~(T)\tilde{O}(\sqrt{T}) in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the kk-means problem. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to (1+Ï”)C(1 + \epsilon) \mathcal{C} and present a no-regret algorithm with runtime O(T(poly(log(T),k,d,1/Ï”)k(d+O(1)))O\left(T(\mathrm{poly}(log(T),k,d,1/\epsilon)^{k(d+O(1))}\right). Our algorithm is based on maintaining an incremental coreset and an adaptive variant of the MWUA. We show that na\"{i}ve online algorithms, such as \emph{Follow The Leader}, fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data

    An Algorithm for Online K-Means Clustering

    Full text link
    This paper shows that one can be competitive with the k-means objective while operating online. In this model, the algorithm receives vectors v_1,...,v_n one by one in an arbitrary order. For each vector the algorithm outputs a cluster identifier before receiving the next one. Our online algorithm generates ~O(k) clusters whose k-means cost is ~O(W*). Here, W* is the optimal k-means cost using k clusters and ~O suppresses poly-logarithmic factors. We also show that, experimentally, it is not much worse than k-means++ while operating in a strictly more constrained computational model

    Online content clustering using variant K-Means Algorithms

    Get PDF
    Thesis (MTech)--Cape Peninsula University of Technology, 2019We live at a time when so much information is created. Unfortunately, much of the information is redundant. There is a huge amount of online information in the form of news articles that discuss similar stories. The number of articles is projected to grow. The growth makes it difficult for a person to process all that information in order to update themselves on a subject matter. There is an overwhelming amount of similar information on the internet. There is need for a solution that can organize this similar information into specific themes. The solution is a branch of Artificial intelligence (AI) called machine learning (ML) using clustering algorithms. This refers to clustering groups of information that is similar into containers. When the information is clustered people can be presented with information on their subject of interest, grouped together. The information in a group can be further processed into a summary. This research focuses on unsupervised learning. Literature has it that K-Means is one of the most widely used unsupervised clustering algorithm. K-Means is easy to learn, easy to implement and is also efficient. However, there is a horde of variations of K-Means. The research seeks to find a variant of K-Means that can be used with an acceptable performance, to cluster duplicate or similar news articles into correct semantic groups. The research is an experiment. News articles were collected from the internet using gocrawler. gocrawler is a program that takes Universal Resource Locators (URLs) as an argument and collects a story from a website pointed to by the URL. The URLs are read from a repository. The stories come riddled with adverts and images from the web page. This is referred to as a dirty text. The dirty text is sanitized. Sanitization is basically cleaning the collected news articles. This includes removing adverts and images from the web page. The clean text is stored in a repository, it is the input for the algorithm. The other input is the K value. All K-Means based variants take K value that defines the number of clusters to be produced. The stories are manually classified and labelled. The labelling is done to check the accuracy of machine clustering. Each story is labelled with a class to which it belongs. The data collection process itself was not unsupervised but the algorithms used to cluster are totally unsupervised. A total of 45 stories were collected and 9 manual clusters were identified. Under each manual cluster there are sub clusters of stories talking about one specific event. The performance of all the variants is compared to see the one with the best clustering results. Performance was checked by comparing the manual classification and the clustering results from the algorithm. Each K-Means variant is run on the same set of settings and same data set, that is 45 stories. The settings used are, ‱ Dimensionality of the feature vectors, ‱ Window size, ‱ Maximum distance between the current and predicted word in a sentence, ‱ Minimum word frequency, ‱ Specified range of words to ignore, ‱ Number of threads to train the model. ‱ The training algorithm either distributed memory (PV-DM) or distributed bag of words (PV-DBOW), ‱ The initial learning rate. The learning rate decreases to minimum alpha as training progresses, ‱ Number of iterations per cycle, ‱ Final learning rate, ‱ Number of clusters to form, ‱ The number of times the algorithm will be run, ‱ The method used for initialization. The results obtained show that K-Means can perform better than K-Modes. The results are tabulated and presented in graphs in chapter six. Clustering can be improved by incorporating Named Entity (NER) recognition into the K-Means algorithms. Results can also be improved by implementing multi-stage clustering technique. Where initial clustering is done then you take the cluster group and further cluster it to achieve finer clustering results

    Penerapan Text Mining pada Analisis Sentimen Pengguna Twitter Layanan Transportasi Online Menggunakan Metode Density Based Spatial Clustering of Applications With Noise (DBSCAN) dan K-Means

    Get PDF
    Transportasi online saat ini menjadi populer dan diminati masyarakat di Indonesia dengan transportasi online yang banyak digunakan adalah Grab dan Gojek. Meskipun layanan transportasi online mendapat respon positif namun terdapat masalah yang dihadapi yaitu banyaknya konsumen yang kecewa dan merasa tidak puas dengan pelayanan yang diberikan. Penelitian ini bertujuan untuk mengetahui pengelompokkan tanggapan masyarakat terhadap kedua transportasi online tersebut. Tanggapan masyarakat mengenai layanan transportasi online didapat dari salah satu media sosial yang banyak digunakan oleh masyarakat Indonesia yaitu twitter. Data pada twitter berupa kumpulan text sehingga diperlukan text mining untuk menganalisisnya. Salah satu analisis dalam text mining adalah text clustering sehingga pada penelitian ini menggunakan text clustering untuk mengelompokkan pendapat menjadi beberapa kategori. Metode yang digunakan pada text clustering adalah metode Density-Based Spatial Clustering of Applications with Noise (DBSCAN) dan K-Means. DBSCAN adalah sebuah metode yang membentuk cluster dari data-data yang saling berdekatan, sedangkan data yang saling berjauhan tidak akan menjadi anggota cluster dan biasa disebut sebagai noise. K-Means adalah teknik clustering yang sederhana dan cepat dalam proses clustering obyek serta mampu mengelompokkan data dalam jumlah cukup besar. Hasil penelitian menunjukkan bahwa metode DBSCAN dan K-Means kurang tepat digunakan pada penelitian ini dalam mengelompokkan tweet yang ditujukan kepada layanan transportasi online Gojek dan Grab karena memiliki nilai silhoutte coefficient kurang dari 0.5 artinya struktur lemah atau tweet tanggapan masyarakat kepada layanan transportasi belum berada pada kelompok yang tepat

    IMPLEMENTATION OF K-MEANS CLUSTERING ANALYSIS TO DETERMINE BARRIERS TO ONLINE LEARNING CASE STUDY: SWASTA YAPENDAK TINJOWAN JUNIOR HIGH SCHOOL

    Get PDF
    The grouping of online learning barriers to students during the covid-19 pandemic will result in clusters of students with the same characteristics in each cluster. The purpose of this study is to assist schools in determining online learning barriers for students during the covid-19 pandemic, so that with this clustering students with high levels of online learning barriers will get additional face-to-face hours.face-to-face learning so as to create an effective learning process. The method used in this study was a data mining technique, which uses the k-means clustering algorithm. This study uses the k-means clustering algorithm because this algorithm is more effective and efficient in processing large amounts of data, so this algorithm has a high enough accuracy for object size and the k-means algorithm is not affected by the order of objects. Testing the data using Microsoft Excel as a manual test and the PHP programming language and MySQL database. The results of this study were in the form of 2 clusters, C1 (low cluster) as many as 4 students who are hampered during online learning, and C2 (high cluster) as many as 16 students who are not hampered during online learning. The conclusion of this study was using of the k-means clustering algorithm can facilitate the grouping of online learning barriers for students at Swasta Yapendak Tinjowan Junior High School

    A fast and recursive algorithm for clustering large datasets with kk-medians

    Get PDF
    Clustering with fast algorithms large samples of high dimensional data is an important challenge in computational statistics. Borrowing ideas from MacQueen (1967) who introduced a sequential version of the kk-means algorithm, a new class of recursive stochastic gradient algorithms designed for the kk-medians loss criterion is proposed. By their recursive nature, these algorithms are very fast and are well adapted to deal with large samples of data that are allowed to arrive sequentially. It is proved that the stochastic gradient algorithm converges almost surely to the set of stationary points of the underlying loss criterion. A particular attention is paid to the averaged versions, which are known to have better performances, and a data-driven procedure that allows automatic selection of the value of the descent step is proposed. The performance of the averaged sequential estimator is compared on a simulation study, both in terms of computation speed and accuracy of the estimations, with more classical partitioning techniques such as kk-means, trimmed kk-means and PAM (partitioning around medoids). Finally, this new online clustering technique is illustrated on determining television audience profiles with a sample of more than 5000 individual television audiences measured every minute over a period of 24 hours.Comment: Under revision for Computational Statistics and Data Analysi

    Improved Algorithms for Time Decay Streams

    Get PDF
    In the time-decay model for data streams, elements of an underlying data set arrive sequentially with the recently arrived elements being more important. A common approach for handling large data sets is to maintain a coreset, a succinct summary of the processed data that allows approximate recovery of a predetermined query. We provide a general framework that takes any offline-coreset and gives a time-decay coreset for polynomial time decay functions. We also consider the exponential time decay model for k-median clustering, where we provide a constant factor approximation algorithm that utilizes the online facility location algorithm. Our algorithm stores O(k log(h Delta)+h) points where h is the half-life of the decay function and Delta is the aspect ratio of the dataset. Our techniques extend to k-means clustering and M-estimators as well
    • 

    corecore