78,373 research outputs found

    Superclustering by finding statistically significant separable groups of optimal gaussian clusters

    Full text link
    The paper presents the algorithm for clustering a dataset by grouping the optimal, from the point of view of the BIC criterion, number of Gaussian clusters into the optimal, from the point of view of their statistical separability, superclusters. The algorithm consists of three stages: representation of the dataset as a mixture of Gaussian distributions - clusters, which number is determined based on the minimum of the BIC criterion; using the Mahalanobis distance, to estimate the distances between the clusters and cluster sizes; combining the resulting clusters into superclusters using the DBSCAN method by finding its hyperparameter (maximum distance) providing maximum value of introduced matrix quality criterion at maximum number of superclusters. The matrix quality criterion corresponds to the proportion of statistically significant separated superclusters among all found superclusters. The algorithm has only one hyperparameter - statistical significance level, and automatically detects optimal number and shape of superclusters based of statistical hypothesis testing approach. The algorithm demonstrates a good results on test datasets in noise and noiseless situations. An essential advantage of the algorithm is its ability to predict correct supercluster for new data based on already trained clusterer and perform soft (fuzzy) clustering. The disadvantages of the algorithm are: its low speed and stochastic nature of the final clustering. It requires a sufficiently large dataset for clustering, which is typical for many statistical methods.Comment: 32 pages, 7 figures, 1 tabl

    Discovering Communities of Community Discovery

    Get PDF
    Discovering communities in complex networks means grouping nodes similar to each other, to uncover latent information about them. There are hundreds of different algorithms to solve the community detection task, each with its own understanding and definition of what a "community" is. Dozens of review works attempt to order such a diverse landscape -- classifying community discovery algorithms by the process they employ to detect communities, by their explicitly stated definition of community, or by their performance on a standardized task. In this paper, we classify community discovery algorithms according to a fourth criterion: the similarity of their results. We create an Algorithm Similarity Network (ASN), whose nodes are the community detection approaches, connected if they return similar groupings. We then perform community detection on this network, grouping algorithms that consistently return the same partitions or overlapping coverage over a span of more than one thousand synthetic and real world networks. This paper is an attempt to create a similarity-based classification of community detection algorithms based on empirical data. It improves over the state of the art by comparing more than seventy approaches, discovering that the ASN contains well-separated groups, making it a sensible tool for practitioners, aiding their choice of algorithms fitting their analytic needs

    A generalization of periodic autoregressive models for seasonal time series

    Get PDF
    Many nonstationary time series exhibit changes in the trend and seasonality structure, that may be modeled by splitting the time axis into different regimes. We propose multi-regime models where, inside each regime, the trend is linear and seasonality is explained by a Periodic Autoregressive model. In addition, for achieving parsimony, we allow season grouping, i.e. seasons may consists of one, two, or more consecutive observations. Since the set of possible solutions is very large, the choice of number of regimes, change times and order and structure of the Autoregressive models is obtained by means of a Genetic Algorithm, and the evaluation of each possible solution is left to an identication criterion such as AIC, BIC or MDL. The comparison and performance of the proposed method are illustrated by a real data analysis. The results suggest that the proposed procedure is useful for analyzing complex phenomena with structural breaks, changes in trend and evolving seasonality

    SISTEM PENERIMAAN SISWA BARU DI SMKN 3 PATI BERDASAR JALUR PRESTASI MENGGUNAKAN ALGORITMA KLASTERING K-MEANS BERBASIS WEB

    Get PDF
    The new student admission system that uses the K-Means algorithm data grouping is the simplest clustering pattern compared to other algorithms. This algorithm is one of the data mining. K-Means groups them into several clusters that have similarities and separates each cluster based on the differences between each cluster.The research of the K-Means Clustering algorithm aims to minimize the functions set during the Clustering process.The implementation of the K-Means Clustering algorithm into the clustering information system provides the results of an effective data grouping classification and the process of each literacy rotation of the Centroid distance, the determination of the Cluster point is formed, student data as a reference object saves more time on clustering the superior class. The application of this web-based clustering information system results in more flexible information that can be accessed at any time by users who are given access rights to utilize the data. The application of the K-Means Clustering Algorithm to get the results of the Superior Class clarification requires an information system implementation to form 3 clusters for each class, namely M1, M2 and M3. M1 as a high score with a criterion value of 85 to 100, M2 as a medium value with a criterion value of 75 to 80 and M3 as a low value with a criterion value of 10 to 70.Sistem penerimaan mahasiswa baru yang menggunakan pengelompokan data algoritma K-Means merupakan pola clustering yang paling sederhana dibandingkan dengan algoritma lainnya. Algoritma ini merupakan salah satu data mining. K-Means mengelompokkannya ke dalam beberapa cluster yang memiliki persamaan dan memisahkan setiap cluster berdasarkan perbedaan tiap clusternya.Penelitian algoritma K-Means Clustering bertujuan untuk meminimalkan fungsi-fungsi yang ditetapkan pada saat proses Clustering.Implementasi K-Means Algoritma clustering ke dalam sistem informasi clustering memberikan hasil klasifikasi pengelompokan data yang efektif dan proses setiap rotasi literasi jarak Centroid, penentuan titik Cluster yang terbentuk, data siswa sebagai objek referensi lebih menghemat waktu pada clustering yang unggul kelas. Penerapan sistem informasi clustering berbasis web ini menghasilkan informasi yang lebih fleksibel yang dapat diakses setiap saat oleh pengguna yang diberikan hak akses untuk memanfaatkan data tersebut. Penerapan Algoritma K-Means Clustering untuk mendapatkan hasil klarifikasi Kelas Unggul membutuhkan implementasi sistem informasi untuk membentuk 3 cluster untuk setiap kelas yaitu M1, M2 dan M3. M1 sebagai nilai tinggi dengan nilai kriteria 85 sampai 100, M2 sebagai nilai sedang dengan nilai kriteria 75 sampai 80 dan M3 sebagai nilai rendah dengan nilai kriteria 10 sampai 70.

    Grouping Objects to Homogeneous Classes Satisfying Requisite Mass

    Get PDF
    Grouping datasets plays an important role in many scientific researches. Depending on data features and applications, different constrains are imposed on groups, while having groups with similar members is always a main criterion. In this paper, we propose an algorithm for grouping the objects with random labels, nominal features having too many nominal attributes. In addition, the size constraint on groups is necessary. These conditions lead to a mixed integer optimization problem which is not convex nor linear. It is an NP-hard problem and exact solution methods are computationally costly. Our motivation to solve such a problem comes along with grouping insurance data which is essential for fair pricing. The proposed algorithm includes two phases. First, we rank random labels using fuzzy numbers. Afterwards, an adjusted K-means algorithm is used to produce homogenous groups satisfying a cluster size constraint. Fuzzy numbers are used to compare random labels, in both observed values and their chance of occurrence. Moreover, an index is defined to find the similarity of multi-valued attributes without perfect information with those accompanied with perfect information. Since all ranks are scaled into the interval [0,1], the result of ranking random labels does not need rescaling techniques. In the adjusted K-means algorithm, the optimum number of clusters is found using coefficient of variation instead of Euclidean distance. Experiments demonstrate that our proposed algorithm produces fairly homogenous and significantly different groups having requisite mass

    A modified parallel tree code for N-body simulation of the Large Scale Structure of the Universe

    Full text link
    N-body codes to perform simulations of the origin and evolution of the Large Scale Structure of the Universe have improved significantly over the past decade both in terms of the resolution achieved and of reduction of the CPU time. However, state-of-the-art N-body codes hardly allow one to deal with particle numbers larger than a few 10^7, even on the largest parallel systems. In order to allow simulations with larger resolution, we have first re-considered the grouping strategy as described in Barnes (1990) (hereafter B90) and applied it with some modifications to our WDSH-PT (Work and Data SHaring - Parallel Tree) code. In the first part of this paper we will give a short description of the code adopting the Barnes and Hut algorithm \cite{barh86} (hereafter BH), and in particular of the memory and work distribution strategy applied to describe the {\it data distribution} on a CC-NUMA machine like the CRAY-T3E system. In the second part of the paper we describe the modification to the Barnes grouping strategy we have devised to improve the performance of the WDSH-PT code. We will use the property that nearby particles have similar interaction list. This idea has been checked in B90, where an interaction list is builded which applies everywhere within a cell C_{group} containing a little number of particles N_{crit}. B90 reuses this interaction list for each particle p∈Cgroup p \in C_{group} in the cell in turn. We will assume each particle p to have the same interaction list. Thus it has been possible to reduce the CPU time increasing the performances. This leads us to run simulations with a large number of particles (N ~ 10^7/10^9) in non-prohibitive times.Comment: 13 pages and 7 Figure

    The composite absolute penalties family for grouped and hierarchical variable selection

    Full text link
    Extracting useful information from high-dimensional data is an important focus of today's statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the L1L_1-penalized squared error minimization method Lasso has been popular in regression models and beyond. In this paper, we combine different norms including L1L_1 to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family, which allows given grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across-group and within-group levels. Grouped selection occurs for nonoverlapping groups. Hierarchical variable selection is reached by defining groups with particular overlapping patterns. We propose using the BLASSO and cross-validation to compute CAP estimates in general. For a subfamily of CAP estimates involving only the L1L_1 and L∞L_{\infty} norms, we introduce the iCAP algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived so that the regularization parameter is selected without cross-validation. CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments, including cases with p≫np\gg n and possibly mis-specified groupings. When the complexity of a model is properly calculated, iCAP is seen to be parsimonious in the experiments.Comment: Published in at http://dx.doi.org/10.1214/07-AOS584 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore