2 research outputs found

    Positive region: An enhancement of partitioning attribute based rough set for categorical data

    Get PDF
    Datasets containing multi-value attributes are often involved in several domains, like pattern recognition, machine learning and data mining. Data partition is required in such cases. Partitioning attributes is the clustering process for the whole data set which is specified for further processing. Recently, there are already existing prominent rough set-based approaches available for group objects and for handling uncertainty data that use indiscernibility attribute and mean roughness measure to perform attribute partitioning. Nevertheless, most of the partitioning attribute methods for selecting partitioning attribute algorithm for categorical data in clustering datasets are incapable of optimal partitioning. This indiscernibility and mean roughness measures, however, require the calculation of the lower approximation, which has less accuracy and it is an expensive task to compute. This reduces the growth of the set of attributes and neglects the data found within the boundary region. This paper presents a new concept called the "Positive Region Based Mean Dependency (PRD)”, that calculates the attribute dependency. In order to determine the mean dependency of the attributes, that is acceptable for categorical datasets, using a positive region-based mean dependency measure, PRD defines the method. By avoiding the lower approximation, PRD is an optimal substitute for the conventional dependency measure in partitioning attribute selection. Contrary to traditional RST partitioning methods, the proposed method can be employed as a measure of data output uncertainty and as a tailback for larger and multiple data clustering. The performance of the method presented is evaluated and compared with the algorithmes of Information-Theoretical Dependence Roughness (ITDR) and Maximum Indiscernible Attribute (MIA)

    New rough set based maximum partitioning attribute algorithm for categorical data clustering

    Get PDF
    Clustering a set of data into homogeneous groups is a fundamental operation in data mining. Recently, consideration has been put on categorical data clustering, where the data set consists of non-numerical attributes. However, implementing several existing categorical clustering algorithms is challenging as some cannot handle uncertainty while others have stability issues. The Rough Set theory (RST) is a mathematical tool for dealing with categorical data and handling uncertainty. It is also used to identify cause-effect relationships in databases as a form of learning and data mining. Therefore, this study aims to address the issues of uncertainty and stability for categorical clustering, and it proposes an improved algorithm centred on RST. The proposed method employed the partitioning measure to calculate the information system's positive and boundary regions of attributes. Firstly, an attributes partitioning method called Positive Region-based Indiscernibility (PRI) was developed to address the uncertainty issue in attribute partitioning for categorical data. The PRI method requires the positive and boundary regions-based partitioning calculation method. Next, to address the computational complexity issue in the clustering process, a clustering attribute selection method called Maximum Mean Partitioning (MMP) is introduced by computing the mean. The MMP method selects the maximum degree of the mean attribute, and the attribute with the maximum mean partitioning value is chosen as the best clustering attribute. The integration of proposed PRI and MMP methods generated a new rough set hybrid clustering algorithm for categorical data clustering algorithm named Maximum Partitioning Attribute (MPA) algorithm. This hybrid algorithm is an all-inclusive solution for uncertainty, computational complexity, cluster purity, and higher accuracy in attribute partitioning and selecting a clustering attribute. The proposed MPA algorithm is compared against the baseline algorithms, namely Maximum Significance Attribute (MSA), Information-Theoretic Dependency Roughness (ITDR), Maximum Indiscernibility Attribute (MIA), and simple classical K-Mean. In addition, seven small data sets from previously utilized research cases and 21 UCI repository and benchmark datasets are used for validation. Finally, the results were presented in tabular and graphical form, showing the proposed MPA algorithm outperforms the baseline algorithms for all data sets. Furthermore, the results showed that the proposed MPA algorithm improves the rough accuracy against MSA, ITDR, and MIA by 54.42%. Hence, the MPA algorithm has reduced the computational complexity compared to MSA, ITDR, and MIA with 77.11% less time and 58.66% minimum iterations. Similarly, a significant percentage improvement, up to 97.35%, was observed for overall purity by the MPA algorithm against MSA, ITDR, and MIA. In addition, the increment up to 34.41% of the overall accuracy of simple K-means by MPA has been obtained. Hence, it is proven that the proposed MPA has given promising solutions to address the categorical data clustering problem
    corecore