8 research outputs found

    Improved Instance Selection Methods for Support Vector Machine Speed Optimization

    Get PDF
    Support vector machine (SVM) is one of the top picks in pattern recognition and classification related tasks. It has been used successfully to classify linearly separable and nonlinearly separable data with high accuracy. However, in terms of classification speed, SVMs are outperformed by many machine learning algorithms, especially, when massive datasets are involved. SVM classification speed scales linearly with number of support vectors, and support vectors increase with increase in dataset size. Hence, SVM classification speed can be enormously reduced if it is trained on a reduced dataset. Instance selection techniques are one of the most effective techniques suitable for minimizing SVM training time. In this study, two instance selection techniques suitable for identifying relevant training instances are proposed. The techniques are evaluated on a dataset containing 4000 emails and results obtained compared to other existing techniques. Result reveals excellent improvement in SVM classification speed

    Hybrid classification approach for imbalanced datasets

    Get PDF
    The research area of imbalanced dataset has been attracted increasing attention from both academic and industrial areas, because it poses a serious issues for so many supervised learning problems. Since the number of majority class dominates the number of minority class are from minority class, if training dataset includes all data in order to fit a classic classifier, the classifier tends to classify all data to majority class by ignoring minority data as noise. Thus, it is very significant to select appropriate training dataset in the prepossessing stage for classification of imbalanced dataset. We propose an combination approach of SMOTE (Synthetic Minority Over-sampling Technique) and instance selection approaches. The numeric results show that the proposed combination approach can help classifiers to achieve better performance

    Combined forecast model involving wavelet-group methods of data handling for drought forecasting

    Get PDF
    Vigorous efforts to improve the effectiveness of drought forecasting models has yet to yield accurate result. The situation gives room on the use of robust forecasting methods that could effectively improve existing methods. The complex nature of time series data does not enable one single method that is suitable in all situations. Thus, a combined model that will provide a better result is then proposed. This study introduces a wavelet and group methods of data handling (GMDH) by integrating discrete wavelet transform (DWT) and GMDH with transfer functions such as sigmoid and radial basis function (RBF) to form three wavelet-GMDH models known as modified W-GMDH (MW-GMDH), sigmoid W-GMDH (SW-GMDH) and RBF W-GMDH. To assess the effectiveness of this approach, these models were applied to rainfall data at four study stations namely Arau and Kuala Krai in Malaysia as well as Badeggi and Duku-Lade in Nigeria. These data were transformed into four Standardized Precipitation Index (SPI) known as SPI3, SPI6, SPI9 and SPI12. The result shows that the integration of DWT improved the performance of the conventional GMDH model. The combination of these models further improved the performance of each model. The proposed model provides efficient, simple, and reliable accuracy when compared with other models. The incorporation of wavelet to the study results in improving performance for all four stations with the Combined W-GMDH (CW-GMDH) and Combined Regression W-GMDH (CRW-GMDH) models. The results show that Duku-Lade station produced the lowest value of 0.0239 and 0.0211 for RMSE and MAE and highest value of 0.9858 for R respectively. In addition, CRW-GMDH model produce the lowest value of 0.0168 and 0.0117, and the highest value of 0.9870 for RMSE MAE, and R respectively. On the percentage improvement, Duku-Lade station shows improvement over other models with the reductions in RMSE and MAE by 42.3% and 80.3% respectively. This indicates that the model is most suitable for the drought forecasting in this station. The results of the comparison among the four stations indicate that the CW-GMDH and CRW-GMDH models are more accurate and perform better than MW-GMDH, SW-GMDH and RBFW-GMDH models. However, the overall performance of the CRW-GMDH model outweigh that of the CW-GMDH model. In conclusion, CRW-GMDH model performs better than other models for drought forecasting and capable of providing a promising alternative to drought forecasting technique

    LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning

    Get PDF
    Cross Project Defect Prediction (CPDP) is a field of study where an organization lacking enough local data can use data from other organizations or projects for building defect predictors. Research in CPDP has shown challenges in using ``other\u27\u27 data, therefore transfer defect learning has emerged to improve on the quality of CPDP results. With this new found success in CPDP, it is now increasingly important to focus on the privacy concerns of data owners.;To support CPDP, data must be shared. There are many privacy threats that inhibit data sharing. We focus on sensitive attribute disclosure threats or attacks, where an attacker seeks to associate a record(s) in a data set to its sensitive information. Solutions to this sharing problem comes from the field of Privacy Preserving Data Publishing (PPDP) which has emerged as a means to confuse the efforts of sensitive attribute disclosure attacks and therefore reduce privacy concerns. PPDP covers methods and tools used to disguise raw data for publishing. However, prior work warned that increasing data privacy decreases the efficacy of data mining on privatized data.;The goal of this research is to help encourage organizations and individuals to share their data publicly and/or with each other for research purposes and/or improving the quality of their software product through defect prediction. The contributions of this work allow three benefits for data owners willing to share privatized data: 1) that they are fully aware of the sensitive attribute disclosure risks involved so they can make an informed decision about what to share, 2) they are provided with the ability to privatize their data and have it remain useful, and 3) the ability to work with others to share their data based on what they learn from each others data. We call this private multiparty data sharing.;To achieve these benefits, this dissertation presents LACE (Large-scale Assurance of Confidentiality Environment). LACE incorporates a privacy metric called IPR (Increased Privacy Ratio) which calculates the risk of sensitive attribute disclosure of data through comparing results of queries (attacks) on the original data and a privatized version of that data. LACE also includes a privacy algorithm which uses intelligent instance selection to prune the data to as low as 10% of the original data (thus offering complete privacy to the other 90%). It then mutates the remaining data making it possible that over 70% of sensitive attribute disclosure attacks are unsuccessful. Finally, LACE can facilitate private multiparty data sharing via a unique leader-follower algorithm (developed for this dissertation). The algorithm allows data owners to serially build a privatized data set, by allowing them to only contribute data that are not already in the private cache. In this scenario, each data owner shares even less of their data, some as low as 2%.;The experiments of this thesis, lead to the following conclusion: at least for the defect data studied here, data can be minimized, privatized and shared without a significant degradation in utility. Specifically, in comparative studies with standard privacy models (k-anonymity and data swapping), applied to 10 open-source data sets and 3 proprietary data sets, LACE produces privatized data sets that are significantly smaller than the original data (as low as 2%). As a result LACE offers better protection against sensitive attribute disclosure attacks than other methods

    Intelligent instance selection techniques for support vector machine speed optimization with application to e-fraud detection.

    Get PDF
    Doctor of Philosophy in Computer Science. University of KwaZulu-Natal, Durban 2017.Decision-making is a very important aspect of many businesses. There are grievous penalties involved in wrong decisions, including financial loss, damage of company reputation and reduction in company productivity. Hence, it is of dire importance that managers make the right decisions. Machine Learning (ML) simplifies the process of decision making: it helps to discover useful patterns from historical data, which can be used for meaningful decision-making. The ability to make strategic and meaningful decisions is dependent on the reliability of data. Currently, many organizations are overwhelmed with vast amounts of data, and unfortunately, ML algorithms cannot effectively handle large datasets. This thesis therefore proposes seven filter-based and five wrapper-based intelligent instance selection techniques for optimizing the speed and predictive accuracy of ML algorithms, with a particular focus on Support Vector Machine (SVM). Also, this thesis proposes a novel fitness function for instance selection. The primary difference between the filter-based and wrapper-based technique is in their method of selection. The filter-based techniques utilizes the proposed fitness function for selection, while the wrapper-based technique utilizes SVM algorithm for selection. The proposed techniques are obtained by fusing SVM algorithm with the following Nature Inspired algorithms: flower pollination algorithm, social spider algorithm, firefly algorithm, cuckoo search algorithm and bat algorithm. Also, two of the filter-based techniques are boundary detection algorithms, inspired by edge detection in image processing and edge selection in ant colony optimization. Two different sets of experiments were performed in order to evaluate the performance of the proposed techniques (wrapper-based and filter-based). All experiments were performed on four datasets containing three popular e-fraud types: credit card fraud, email spam and phishing email. In addition, experiments were performed on 20 datasets provided by the well-known UCI data repository. The results show that the proposed filter-based techniques excellently improved SVM training speed in 100% (24 out of 24) of the datasets used for evaluation, without significantly affecting SVM classification quality. Moreover, experimental results also show that the wrapper-based techniques consistently improved SVM predictive accuracy in 78% (18 out of 23) of the datasets used for evaluation and simultaneously improved SVM training speed in all cases. Furthermore, two different statistical tests were conducted to further validate the credibility of the results: Freidman’s test and Holm’s post-hoc test. The statistical test results reveal that the proposed filter-based and wrapper-based techniques are significantly faster, compared to standard SVM and some existing instance selection techniques, in all cases. Moreover, statistical test results also reveal that Cuckoo Search Instance Selection Algorithm outperform all the proposed techniques, in terms of speed. Overall, the proposed techniques have proven to be fast and accurate ML-based e-fraud detection techniques, with improved training speed, predictive accuracy and storage reduction. In real life application, such as video surveillance and intrusion detection systems, that require a classifier to be trained very quickly for speedy classification of new target concepts, the filter-based techniques provide the best solutions; while the wrapper-based techniques are better suited for applications, such as email filters, that are very sensitive to slight changes in predictive accuracy

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications
    corecore