10,379 research outputs found
Certainty of outlier and boundary points processing in data mining
Data certainty is one of the issues in the real-world applications which is
caused by unwanted noise in data. Recently, more attentions have been paid to
overcome this problem. We proposed a new method based on neutrosophic set (NS)
theory to detect boundary and outlier points as challenging points in
clustering methods. Generally, firstly, a certainty value is assigned to data
points based on the proposed definition in NS. Then, certainty set is presented
for the proposed cost function in NS domain by considering a set of main
clusters and noise cluster. After that, the proposed cost function is minimized
by gradient descent method. Data points are clustered based on their membership
degrees. Outlier points are assigned to noise cluster and boundary points are
assigned to main clusters with almost same membership degrees. To show the
effectiveness of the proposed method, two types of datasets including 3
datasets in Scatter type and 4 datasets in UCI type are used. Results
demonstrate that the proposed cost function handles boundary and outlier points
with more accurate membership degrees and outperforms existing state of the art
clustering methods.Comment: Conference Paper, 6 page
Recommended from our members
The role of human factors in stereotyping behavior and perception of digital library users: A robust clustering approach
To deliver effective personalization for digital library users, it is necessary to identify which human factors are most relevant in determining the behavior and perception of these users. This paper examines three key human factors: cognitive styles, levels of expertise and gender differences, and utilizes three individual clustering techniques: k-means, hierarchical clustering and fuzzy clustering to understand user behavior and perception. Moreover, robust clustering, capable of correcting the bias of individual clustering techniques, is used to obtain a deeper understanding. The robust clustering approach produced results that highlighted the relevance of cognitive style for user behavior, i.e., cognitive style dominates and justifies each of the robust clusters created. We also found that perception was mainly determined by the level of expertise of a user. We conclude that robust clustering is an effective technique to analyze user behavior and perception
Clustering Time Series from Mixture Polynomial Models with Discretised Data
Clustering time series is an active research area with applications in many fields. One common feature of time series is the likely presence of outliers. These uncharacteristic data can significantly effect the quality of clusters formed. This paper evaluates a method of over-coming the detrimental effects of outliers. We describe some of the alternative approaches to clustering time series, then specify a particular class of model for experimentation with k-means clustering and a correlation based distance metric. For data derived from this class of model we demonstrate that discretising the data into a binary series of above and below the median improves the clustering when the data has outliers. More specifically, we show that firstly discretisation does not significantly effect the accuracy of the clusters when there are no outliers and secondly it significantly increases the accuracy in the presence of outliers, even when the probability of outlier is very low
Tracking time evolving data streams for short-term traffic forecasting
YesData streams have arisen as a relevant topic during the last few years as an efficient method for extracting knowledge from big data. In the robust layered ensemble model (RLEM) proposed in this paper for short-term traffic flow forecasting, incoming traffic flow data of all connected road links are organized in chunks corresponding to an optimal time lag. The RLEM model is composed of two layers. In the first layer, we cluster the chunks by using the Graded Possibilistic c-Means method. The second layer is made up by an ensemble of forecasters, each of them trained for short-term traffic flow forecasting on the chunks belonging to a specific cluster. In the operational phase, as a new chunk of traffic flow data presented as input to the RLEM, its memberships to all clusters are evaluated, and if it is not recognized as an outlier, the outputs of all forecasters are combined in an ensemble, obtaining in this a way a forecasting of traffic flow for a short-term time horizon. The proposed RLEM model is evaluated on a synthetic data set, on a traffic flow data simulator and on two real-world traffic flow data sets. The model gives an accurate forecasting of the traffic flow rates with outlier detection and shows a good adaptation to non-stationary traffic regimes. Given its characteristics of outlier detection, accuracy, and robustness, RLEM can be fruitfully integrated in traffic flow management systems
Data mining as a tool for environmental scientists
Over recent years a huge library of data mining algorithms has been developed to tackle a variety of problems in fields such as medical imaging and network traffic analysis. Many of these techniques are far more flexible than more classical modelling approaches and could be usefully applied to data-rich environmental problems. Certain techniques such as Artificial Neural Networks, Clustering, Case-Based Reasoning and more recently Bayesian Decision Networks have found application in environmental modelling while other methods, for example classification and association rule extraction, have not yet been taken up on any wide scale. We propose that these and other data mining techniques could be usefully applied to difficult problems in the field. This paper introduces several data mining concepts and briefly discusses their application to environmental modelling, where data may be sparse, incomplete, or heterogenous
- …