3 research outputs found

    Semi-random partitioning of data into training and test sets in granular computing context

    Get PDF
    Due to the vast and rapid increase in the size of data, machine learning has become an increasingly more popular approach for the purpose of knowledge discovery and predictive modelling. For both of the above purposes, it is essential to have a data set partitioned into a training set and a test set. In particular, the training set is used towards learning a model and the test set is then used towards evaluating the performance of the model learned from the training set. The split of the data into the two sets, however, and the influence on model performance, has only been investigated with respect to the optimal proportion for the two sets, with no attention paid to the characteristics of the data within the training and test sets. Thus, the current practice is to randomly split the data into approximately 70% for training and 30% for testing. In this paper, we show that this way of partitioning the data leads to two major issues: (a) class imbalance and (b) sample representativeness issues. Class imbalance is known to affect the performance of many classifiers by introducing a bias towards the majority class; the representativeness of the training set affects a model’s performance through the lack of opportunity for the algorithm to learn, by not presenting it with relevant examples—similar to testing a student on material that was not taught. To solve the above two issues, we propose a semi-random data partitioning framework, in the setting of granular computing. While we discuss how the framework can address both issues, in this paper, we focus on avoiding class imbalance when partitioning the data, through the proposed approach. The results show that avoiding class imbalance results in better model performance

    Nature inspired framework of ensemble learning for collaborative classification in granular computing context

    Get PDF
    Due to the vast and rapid increase in the size of data, machine learning has become an increasingly popular approach of data classification, which can be done by training a single classifier or a group of classifiers. A single classifier is typically learned by using a standard algorithm, such as C4.5. Due to the fact that each of the standard learning algorithms has its own advantages and disadvantages, ensemble learning, such as Bagging, has been increasingly used to learn a group of classifiers for collaborative classification, thus compensating for the disadvantages of individual classifiers. In particular, a group of base classifiers need to be learned in the training stage, and then some or all of the base classifiers are employed for classifying unseen instances in the testing stage. In this paper, we address two critical points that can impact the classification accuracy, in order to overcome the limitations of the Bagging approach. Firstly, it is important to judge effectively which base classifiers qualify to get employed for classifying test instances. Secondly, the final classification needs to be done by combining the outputs of the base classifiers, i.e. voting, which indicates that the strategy of voting can impact greatly on whether a test instance is classified correctly. In order to address the above points, we propose a nature-inspired approach of ensemble learning to improve the overall accuracy in the setting of granular computing. The proposed approach is validated through experimental studies by using real-life data sets. The results show that the proposed approach overcomes effectively the limitations of the Bagging approach
    corecore