39,152 research outputs found
Using Unlabeled Data Set for Mining Knowledge from DDB
In this paper, two algorithms were introduced to describe two algorithms to describe and compare the applying of the proposed technique in the two types of the distributed database system. The First Proposed Algorithm is Homogeneous Distributed Clustering for Classification (HOMDC4C), which aim to learn a classification model from unlabeled datasets distributed homogenously over the network, this is done by building a local clustering model on the datasets distributed over three sites in the network and then build a local classification model based on labeled data that produce from clustering model. In the one computer considered as a control computer, we build a global classification model and then use this model in the future predictive. The Second Proposed Algorithm in Heterogeneous Distributed Clustering for Classification (HETDC4C) aims to build a classification model over unlabeled datasets distributed heterogeneously over sites of the network, the datasets in this algorithm collected in one central computer and then build the clustering model and then classification model. The objective of this work is to use the unlabeled data to introduce a set of labeled data that are useful for build a classification model that can predict any unlabeled instance based on that classification model. This was done by using the Clustering for Classification technique. Then presented this technique in distributed database environment to reduce the execution time and storage space that is required
NetCluster: a Clustering-Based Framework for Internet Tomography
Abstract â In this paper, Internet data collected via passive measurement are analyzed to obtain localization information on nodes by clustering (i.e., grouping together) nodes that exhibit similar network path properties. Since traditional clustering algorithms fail to correctly identify clusters of homogeneous nodes, we propose a novel framework, named âNetClusterâ, suited to analyze Internet measurement datasets. We show that the proposed framework correctly analyzes synthetically generated traces. Finally, we apply it to real traces collected at the access link of our campus LAN and discuss the network characteristics as seen at the vantage point. I. INTRODUCTION AND MOTIVATIONS The Internet is a complex distributed system which continues to grow and evolve. The unregulated and heterogeneous structure of the current Internet makes it challenging to obtai
Anytime Hierarchical Clustering
We propose a new anytime hierarchical clustering method that iteratively
transforms an arbitrary initial hierarchy on the configuration of measurements
along a sequence of trees we prove for a fixed data set must terminate in a
chain of nested partitions that satisfies a natural homogeneity requirement.
Each recursive step re-edits the tree so as to improve a local measure of
cluster homogeneity that is compatible with a number of commonly used (e.g.,
single, average, complete) linkage functions. As an alternative to the standard
batch algorithms, we present numerical evidence to suggest that appropriate
adaptations of this method can yield decentralized, scalable algorithms
suitable for distributed/parallel computation of clustering hierarchies and
online tracking of clustering trees applicable to large, dynamically changing
databases and anomaly detection.Comment: 13 pages, 6 figures, 5 tables, in preparation for submission to a
conferenc
Merging -means with hierarchical clustering for identifying general-shaped groups
Clustering partitions a dataset such that observations placed together in a
group are similar but different from those in other groups. Hierarchical and
-means clustering are two approaches but have different strengths and
weaknesses. For instance, hierarchical clustering identifies groups in a
tree-like structure but suffers from computational complexity in large datasets
while -means clustering is efficient but designed to identify homogeneous
spherically-shaped clusters. We present a hybrid non-parametric clustering
approach that amalgamates the two methods to identify general-shaped clusters
and that can be applied to larger datasets. Specifically, we first partition
the dataset into spherical groups using -means. We next merge these groups
using hierarchical methods with a data-driven distance measure as a stopping
criterion. Our proposal has the potential to reveal groups with general shapes
and structure in a dataset. We demonstrate good performance on several
simulated and real datasets.Comment: 16 pages, 1 table, 9 figures; accepted for publication in Sta
Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters
There are two notoriously hard problems in cluster analysis, estimating the
number of clusters, and checking whether the population to be clustered is not
actually homogeneous. Given a dataset, a clustering method and a cluster
validation index, this paper proposes to set up null models that capture
structural features of the data that cannot be interpreted as indicating
clustering. Artificial datasets are sampled from the null model with parameters
estimated from the original dataset. This can be used for testing the null
hypothesis of a homogeneous population against a clustering alternative. It can
also be used to calibrate the validation index for estimating the number of
clusters, by taking into account the expected distribution of the index under
the null model for any given number of clusters. The approach is illustrated by
three examples, involving various different clustering techniques (partitioning
around medoids, hierarchical methods, a Gaussian mixture model), validation
indexes (average silhouette width, prediction strength and BIC), and issues
such as mixed type data, temporal and spatial autocorrelation
- âŠ