8 research outputs found
Short statistics on the three microarray experimental data used in the testing of our algorithm and the other three variants of k-means algorithm.
<p>The second and third columns indicate the total number of genes covered in each experiment and the number of points (at equal interval) at which the genes transcriptional expression are measured.</p
Hubert-Arabie Adjusted Rand Index (ARI<sub>HA</sub>) Cluster Quality Computation Result for Biological and Non-biological data.
<p>For each data, Bozdech et al. 3D7 and HB3 strains <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049946#pone.0049946-Bozdech1" target="_blank">[26]</a> and Le Roch et al. <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049946#pone.0049946-LeRoch1" target="_blank">[27]</a>, we used two values of k to demonstrate the effect of changing k values on the clusters quality of the clustering algorithms. We considered the structure of the Traditional k-means as the known structure and compare the clusters of MM, Enhanced and Overlapped k-means respectively with it. In a separate (last) column, we also compare the structure of the Enhanced k-means with that of Overlapped k-means.</p
Non-Biological data used for testing our algorithm and the other three variants of k-means algorithm.
<p>Abalone dataset described with 8 attributes represents physical measurements of abalone (sea organism). Wind dataset described by 12 attributes represents measurements on wind from 1/1/1961 to 31/12/1978. Letter dataset represents the image of English capital letters described by 16 primitive numerical attributes (statistical moments and edge counts).</p
Performance comparison for all types of k-means algorithms considered for very large data sets.
<p>This constitute simulation of three large data sets in the order of; 10,000Γ50, 30,000Γ50 and 50,000Γ50 dimension. The range of K used is 10β€Kβ€40 for the four algorithms.</p
Execution Time (Bozdech <i>et al.</i>, <i>P.f</i> 3D7 Microarray Dataset).
<p>The plot shows that our MMk-means has the fastest run-time for tested number of clusters, 15β€kβ€25. Comparatively, kβ=β20 took the longest run-time for all the four algorithms, implying that this is a function of the nature of the data under consideration.</p
Pseudocode of our Compute_MM Sub-program for <i>MMk-means</i>.
<p>We create a covariance matrix, computing the Pearson product moment correlation coefficient between the k centroids of the previous and current iterations and then deduce k previous and current iterations eigenvalues. The difference of these eigenvalues for each cluster is computed and checked to see if it satisfies the <i>Ding-H</i>e interval.</p
Pseudocode of our main program for <i>MMk-means</i>.
<p>It runs similar to the traditional k-means except that it is equipped with a metric matrices based mechanism to determine when a cluster is stable (that is, its members will not move from this cluster in subsequent iteration). This mechanism is implemented in sub-procedure Compute_MM of <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049946#pone-0049946-g001" target="_blank"><i>Figure</i> 1</a>. We use the theory developed by Zha <i>et al. </i><a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049946#pone.0049946-Zha1" target="_blank">[20]</a> from the singular values of the matrix X of the input data points to determine when it is appropriate to execute Compute_MM during the k-means iterations. This is implemented in lines 34β40.</p
Quality of Clusters (Bozdech <i>et al.</i>, <i>P.f</i> 3D7 Microarray Dataset).
<p>The qualities of clusters for the four algorithms are similar. The MSE decreases gradually as the number of clusters increases except for kβ=β21 that has a higher MSE than when kβ=β20.</p