2,787 research outputs found
An Enhanced Initialization Method to Find an Initial Center for K-modes Clustering
Data mining is a technique which extracts the information from the large amount of data. To group the objects having similar characteristics, clustering method is used. K-means clustering algorithm is very efficient for large data sets deals with numerical quantities however it not works well for real world data sets which contain categorical values for most of the attributes. K-modes algorithm is used in the place of K-means algorithm. In the existing system, the initialization of K- modes clustering from the view of outlier detection is considered. It avoids that various initial cluster centers come from the same cluster. To overcome the above said limitation, it uses Initial_Distance and Initial_Entropy algorithms which use a new weightage formula to calculate the degree of outlierness of each object. K-modes algorithm can guarantee that the chosen initial cluster centers are not outliers. To improve the performance further, a new modified distance metric -weighted matching distance is used to calculate the distance between two objects during the process of initialization. As well as, one of the data pre-processing methods is used to improve the quality of data. Experiments are carried out on several data sets from UCI repository and the results demonstrated the effectiveness of the initialization method in the proposed algorithm
Outlier Detection using Boxplot-Mean Algorithm
In this paper, we present a novel method for the detection of outlier in intrusion detection system. The proposed detection algorithm, are called hybrid algorithm. It is combination of two algorithm k-mean and boxplot. Experimental results demonstrate to be superior to existing SCF algorithm. One of the most common problems in existing SCF technique detection techniques is that such as ignoring dependency among categorical variables, handling data streams and mixed data sets. Moreover, identifying number of outliers in advance is an impractical issue in the SCF algorithm and other outlier identification techniques. This paper investigates the performances of boxplot-mean method for detecting different types of abnormal data. Keywords: Outlier detection techniques, clustering, scf, genetic and boxplotmean technique
A Method Non-Deterministic and Computationally Viable for Detecting Outliers in Large Datasets
This paper presents an outlier detection method that is based on a Variable Precision Rough Set Model (VPRSM). This method generalizes the standard set inclusion relation, which is the foundation of the Rough Sets Basic Model (RSBM). The main contribution of this research is an improvement in the quality of detection because this generalization allows us to classify when there is some degree of uncertainty. From the proposed method, a computationally viable algorithm for large volumes of data is also introduced. The experiments performed in a real scenario and a comparison of the results with the RSBM-based method demonstrate the efficiency of both the method and the algorithm in diverse contexts that involve large volumes of data.This work has been supported by grant TIN2016-78103-C2-2-R, and University of Alicante projects GRE14-02 and Smart University
Data Stream Clustering: Challenges and Issues
Very large databases are required to store massive amounts of data that are
continuously inserted and queried. Analyzing huge data sets and extracting
valuable pattern in many applications are interesting for researchers. We can
identify two main groups of techniques for huge data bases mining. One group
refers to streaming data and applies mining techniques whereas second group
attempts to solve this problem directly with efficient algorithms. Recently
many researchers have focused on data stream as an efficient strategy against
huge data base mining instead of mining on entire data base. The main problem
in data stream mining means evolving data is more difficult to detect in this
techniques therefore unsupervised methods should be applied. However,
clustering techniques can lead us to discover hidden information. In this
survey, we try to clarify: first, the different problem definitions related to
data stream clustering in general; second, the specific difficulties
encountered in this field of research; third, the varying assumptions,
heuristics, and intuitions forming the basis of different approaches; and how
several prominent solutions tackle different problems. Index Terms- Data
Stream, Clustering, K-Means, Concept driftComment: IMECS201
A review of clustering techniques and developments
© 2017 Elsevier B.V. This paper presents a comprehensive study on clustering: exiting methods and developments made at various times. Clustering is defined as an unsupervised learning where the objects are grouped on the basis of some similarity inherent among them. There are different methods for clustering the objects such as hierarchical, partitional, grid, density based and model based. The approaches used in these methods are discussed with their respective states of art and applicability. The measures of similarity as well as the evaluation criteria, which are the central components of clustering, are also presented in the paper. The applications of clustering in some fields like image segmentation, object and character recognition and data mining are highlighted
- …