78,373 research outputs found
Superclustering by finding statistically significant separable groups of optimal gaussian clusters
The paper presents the algorithm for clustering a dataset by grouping the
optimal, from the point of view of the BIC criterion, number of Gaussian
clusters into the optimal, from the point of view of their statistical
separability, superclusters.
The algorithm consists of three stages: representation of the dataset as a
mixture of Gaussian distributions - clusters, which number is determined based
on the minimum of the BIC criterion; using the Mahalanobis distance, to
estimate the distances between the clusters and cluster sizes; combining the
resulting clusters into superclusters using the DBSCAN method by finding its
hyperparameter (maximum distance) providing maximum value of introduced matrix
quality criterion at maximum number of superclusters. The matrix quality
criterion corresponds to the proportion of statistically significant separated
superclusters among all found superclusters.
The algorithm has only one hyperparameter - statistical significance level,
and automatically detects optimal number and shape of superclusters based of
statistical hypothesis testing approach. The algorithm demonstrates a good
results on test datasets in noise and noiseless situations. An essential
advantage of the algorithm is its ability to predict correct supercluster for
new data based on already trained clusterer and perform soft (fuzzy)
clustering. The disadvantages of the algorithm are: its low speed and
stochastic nature of the final clustering. It requires a sufficiently large
dataset for clustering, which is typical for many statistical methods.Comment: 32 pages, 7 figures, 1 tabl
Discovering Communities of Community Discovery
Discovering communities in complex networks means grouping nodes similar to
each other, to uncover latent information about them. There are hundreds of
different algorithms to solve the community detection task, each with its own
understanding and definition of what a "community" is. Dozens of review works
attempt to order such a diverse landscape -- classifying community discovery
algorithms by the process they employ to detect communities, by their
explicitly stated definition of community, or by their performance on a
standardized task. In this paper, we classify community discovery algorithms
according to a fourth criterion: the similarity of their results. We create an
Algorithm Similarity Network (ASN), whose nodes are the community detection
approaches, connected if they return similar groupings. We then perform
community detection on this network, grouping algorithms that consistently
return the same partitions or overlapping coverage over a span of more than one
thousand synthetic and real world networks. This paper is an attempt to create
a similarity-based classification of community detection algorithms based on
empirical data. It improves over the state of the art by comparing more than
seventy approaches, discovering that the ASN contains well-separated groups,
making it a sensible tool for practitioners, aiding their choice of algorithms
fitting their analytic needs
A generalization of periodic autoregressive models for seasonal time series
Many nonstationary time series exhibit changes in the trend and seasonality structure, that may be modeled by splitting the time axis into different regimes. We propose multi-regime models where, inside each regime, the trend is linear and seasonality is explained by a Periodic Autoregressive model. In addition, for achieving parsimony, we allow season grouping, i.e. seasons may consists of one, two, or more consecutive observations. Since the set of possible solutions is very large, the choice of number of regimes, change times and order and structure of the Autoregressive models is obtained by means of a Genetic Algorithm, and the evaluation of each possible solution is left to an identication criterion such as AIC, BIC or MDL. The comparison and performance of the proposed method are illustrated by a real data analysis. The results suggest that the proposed procedure is useful for analyzing complex phenomena with structural breaks, changes in trend and evolving seasonality
SISTEM PENERIMAAN SISWA BARU DI SMKN 3 PATI BERDASAR JALUR PRESTASI MENGGUNAKAN ALGORITMA KLASTERING K-MEANS BERBASIS WEB
The new student admission system that uses the K-Means algorithm data grouping is the simplest clustering pattern compared to other algorithms. This algorithm is one of the data mining. K-Means groups them into several clusters that have similarities and separates each cluster based on the differences between each cluster.The research of the K-Means Clustering algorithm aims to minimize the functions set during the Clustering process.The implementation of the K-Means Clustering algorithm into the clustering information system provides the results of an effective data grouping classification and the process of each literacy rotation of the Centroid distance, the determination of the Cluster point is formed, student data as a reference object saves more time on clustering the superior class. The application of this web-based clustering information system results in more flexible information that can be accessed at any time by users who are given access rights to utilize the data. The application of the K-Means Clustering Algorithm to get the results of the Superior Class clarification requires an information system implementation to form 3 clusters for each class, namely M1, M2 and M3. M1 as a high score with a criterion value of 85 to 100, M2 as a medium value with a criterion value of 75 to 80 and M3 as a low value with a criterion value of 10 to 70.Sistem penerimaan mahasiswa baru yang menggunakan pengelompokan data algoritma K-Means merupakan pola clustering yang paling sederhana dibandingkan dengan algoritma lainnya. Algoritma ini merupakan salah satu data mining. K-Means mengelompokkannya ke dalam beberapa cluster yang memiliki persamaan dan memisahkan setiap cluster berdasarkan perbedaan tiap clusternya.Penelitian algoritma K-Means Clustering bertujuan untuk meminimalkan fungsi-fungsi yang ditetapkan pada saat proses Clustering.Implementasi K-Means Algoritma clustering ke dalam sistem informasi clustering memberikan hasil klasifikasi pengelompokan data yang efektif dan proses setiap rotasi literasi jarak Centroid, penentuan titik Cluster yang terbentuk, data siswa sebagai objek referensi lebih menghemat waktu pada clustering yang unggul kelas. Penerapan sistem informasi clustering berbasis web ini menghasilkan informasi yang lebih fleksibel yang dapat diakses setiap saat oleh pengguna yang diberikan hak akses untuk memanfaatkan data tersebut. Penerapan Algoritma K-Means Clustering untuk mendapatkan hasil klarifikasi Kelas Unggul membutuhkan implementasi sistem informasi untuk membentuk 3 cluster untuk setiap kelas yaitu M1, M2 dan M3. M1 sebagai nilai tinggi dengan nilai kriteria 85 sampai 100, M2 sebagai nilai sedang dengan nilai kriteria 75 sampai 80 dan M3 sebagai nilai rendah dengan nilai kriteria 10 sampai 70.
Grouping Objects to Homogeneous Classes Satisfying Requisite Mass
Grouping datasets plays an important role in many scientific researches. Depending on data features and applications, different constrains are imposed on groups, while having groups with similar members is always a main criterion. In this paper, we propose an algorithm for grouping the objects with random labels, nominal features having too many nominal attributes. In addition, the size constraint on groups is necessary. These conditions lead to a mixed integer optimization problem which is not convex nor linear. It is an NP-hard problem and exact solution methods are computationally costly. Our motivation to solve such a problem comes along with grouping insurance data which is essential for fair pricing. The proposed algorithm includes two phases. First, we rank random labels using fuzzy numbers. Afterwards, an adjusted K-means algorithm is used to produce homogenous groups satisfying a cluster size constraint. Fuzzy numbers are used to compare random labels, in both observed values and their chance of occurrence. Moreover, an index is defined to find the similarity of multi-valued attributes without perfect information with those accompanied with perfect information. Since all ranks are scaled into the interval [0,1], the result of ranking random labels does not need rescaling techniques. In the adjusted K-means algorithm, the optimum number of clusters is found using coefficient of variation instead of Euclidean distance. Experiments demonstrate that our proposed algorithm produces fairly homogenous and significantly different groups having requisite mass
A modified parallel tree code for N-body simulation of the Large Scale Structure of the Universe
N-body codes to perform simulations of the origin and evolution of the Large
Scale Structure of the Universe have improved significantly over the past
decade both in terms of the resolution achieved and of reduction of the CPU
time. However, state-of-the-art N-body codes hardly allow one to deal with
particle numbers larger than a few 10^7, even on the largest parallel systems.
In order to allow simulations with larger resolution, we have first
re-considered the grouping strategy as described in Barnes (1990) (hereafter
B90) and applied it with some modifications to our WDSH-PT (Work and Data
SHaring - Parallel Tree) code. In the first part of this paper we will give a
short description of the code adopting the Barnes and Hut algorithm
\cite{barh86} (hereafter BH), and in particular of the memory and work
distribution strategy applied to describe the {\it data distribution} on a
CC-NUMA machine like the CRAY-T3E system. In the second part of the paper we
describe the modification to the Barnes grouping strategy we have devised to
improve the performance of the WDSH-PT code. We will use the property that
nearby particles have similar interaction list. This idea has been checked in
B90, where an interaction list is builded which applies everywhere within a
cell C_{group} containing a little number of particles N_{crit}. B90 reuses
this interaction list for each particle in the cell in turn.
We will assume each particle p to have the same interaction list.
Thus it has been possible to reduce the CPU time increasing the performances.
This leads us to run simulations with a large number of particles (N ~
10^7/10^9) in non-prohibitive times.Comment: 13 pages and 7 Figure
The composite absolute penalties family for grouped and hierarchical variable selection
Extracting useful information from high-dimensional data is an important
focus of today's statistical research and practice. Penalized loss function
minimization has been shown to be effective for this task both theoretically
and empirically. With the virtues of both regularization and sparsity, the
-penalized squared error minimization method Lasso has been popular in
regression models and beyond. In this paper, we combine different norms
including to form an intelligent penalty in order to add side information
to the fitting of a regression or classification model to obtain reasonable
estimates. Specifically, we introduce the Composite Absolute Penalties (CAP)
family, which allows given grouping and hierarchical relationships between the
predictors to be expressed. CAP penalties are built by defining groups and
combining the properties of norm penalties at the across-group and within-group
levels. Grouped selection occurs for nonoverlapping groups. Hierarchical
variable selection is reached by defining groups with particular overlapping
patterns. We propose using the BLASSO and cross-validation to compute CAP
estimates in general. For a subfamily of CAP estimates involving only the
and norms, we introduce the iCAP algorithm to trace the entire
regularization path for the grouped selection problem. Within this subfamily,
unbiased estimates of the degrees of freedom (df) are derived so that the
regularization parameter is selected without cross-validation. CAP is shown to
improve on the predictive performance of the LASSO in a series of simulated
experiments, including cases with and possibly mis-specified
groupings. When the complexity of a model is properly calculated, iCAP is seen
to be parsimonious in the experiments.Comment: Published in at http://dx.doi.org/10.1214/07-AOS584 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …