297 research outputs found

    Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics

    Get PDF
    In this dissertation, the problem of learning from highly imbalanced data is studied. Imbalance data learning is of great importance and challenge in many real applications. Dealing with a minority class normally needs new concepts, observations and solutions in order to fully understand the underlying complicated models. We try to systematically review and solve this special learning task in this dissertation.We propose a new ensemble learning framework—Diversified Ensemble Classifiers for Imbal-anced Data Learning (DECIDL), based on the advantages of existing ensemble imbalanced learning strategies. Our framework combines three learning techniques: a) ensemble learning, b) artificial example generation, and c) diversity construction by reversely data re-labeling. As a meta-learner, DECIDL utilizes general supervised learning algorithms as base learners to build an ensemble committee. We create a standard benchmark data pool, which contains 30 highly skewed sets with diverse characteristics from different domains, in order to facilitate future research on imbalance data learning. We use this benchmark pool to evaluate and compare our DECIDL framework with several ensemble learning methods, namely under-bagging, over-bagging, SMOTE-bagging, and AdaBoost. Extensive experiments suggest that our DECIDL framework is comparable with other methods. The data sets, experiments and results provide a valuable knowledge base for future research on imbalance learning. We develop a simple but effective artificial example generation method for data balancing. Two new methods DBEG-ensemble and DECIDL-DBEG are then designed to improve the power of imbalance learning. Experiments show that these two methods are comparable to the state-of-the-art methods, e.g., GSVM-RU and SMOTE-bagging. Furthermore, we investigate learning on imbalanced data from a new angle—active learning. By combining active learning with the DECIDL framework, we show that the newly designed Active-DECIDL method is very effective for imbalance learning, suggesting the DECIDL framework is very robust and flexible.Lastly, we apply the proposed learning methods to a real-world bioinformatics problem—protein methylation prediction. Extensive computational results show that the DECIDL method does perform very well for the imbalanced data mining task. Importantly, the experimental results have confirmed our new contributions on this particular data learning problem

    An advance extended binomial GLMBoost ensemble method with synthetic minority over-sampling technique for handling imbalanced datasets

    Get PDF
    Classification is an important activity in a variety of domains. Class imbalance problem have reduced the performance of the traditional classification approaches. An imbalance problem arises when mismatched class distributions are discovered among the instances of class of classification datasets. An advance extended binomial GLMBoost (EBGLMBoost) coupled with synthetic minority over-sampling technique (SMOTE) technique is the proposed model in the study to manage imbalance issues. The SMOTE is used to solve the proposed model, ensuring that the target variable's distribution is balanced, whereas the GLMBoost ensemble techniques are built to deal with imbalanced datasets. For the entire experiment, twenty different datasets are used, and support vector machine (SVM), Nu-SVM, bagging, and AdaBoost classification algorithms are used to compare with the suggested method. The model's sensitivity, specificity, geometric mean (G-mean), precision, recall, and F-measure resulted in percentages for training and testing datasets are 99.37, 66.95, 80.81, 99.21, 99.37, 99.29 and 98.61, 54.78, 69.88, 98.77, 96.61, 98.68, respectively. With the help of the Wilcoxon test, it is determined that the proposed technique performed well on unbalanced data. Finally, the proposed solutions are capable of efficiently dealing with the problem of class imbalance

    A Study of Boosting based Transfer Learning for Activity and Gesture Recognition

    Get PDF
    abstract: Real-world environments are characterized by non-stationary and continuously evolving data. Learning a classification model on this data would require a framework that is able to adapt itself to newer circumstances. Under such circumstances, transfer learning has come to be a dependable methodology for improving classification performance with reduced training costs and without the need for explicit relearning from scratch. In this thesis, a novel instance transfer technique that adapts a "Cost-sensitive" variation of AdaBoost is presented. The method capitalizes on the theoretical and functional properties of AdaBoost to selectively reuse outdated training instances obtained from a "source" domain to effectively classify unseen instances occurring in a different, but related "target" domain. The algorithm is evaluated on real-world classification problems namely accelerometer based 3D gesture recognition, smart home activity recognition and text categorization. The performance on these datasets is analyzed and evaluated against popular boosting-based instance transfer techniques. In addition, supporting empirical studies, that investigate some of the less explored bottlenecks of boosting based instance transfer methods, are presented, to understand the suitability and effectiveness of this form of knowledge transfer.Dissertation/ThesisM.S. Computer Science 201

    A Machine Learning Approach to Diagnosis of Parkinson’s Disease

    Get PDF
    I will investigate applications of machine learning algorithms to medical data, adaptations of differences in data collection, and the use of ensemble techniques. Focusing on the binary classification problem of Parkinson’s Disease (PD) diagnosis, I will apply machine learning algorithms to a primary dataset consisting of voice recordings from healthy and PD subjects. Specifically, I will use Artificial Neural Networks, Support Vector Machines, and an Ensemble Learning algorithm to reproduce results from [MS12] and [GM09]. Next, I will adapt a secondary regression dataset of PD recordings and combine it with the primary binary classification dataset, testing various techniques to consolidate the data including treating the regression data as unlabeled data in a semi-supervised learning approach. I will determine the performance of the above algorithms on this consolidated dataset. Performance of algorithms will be evaluated using 10-fold cross validation and results will be analyzed in a confusion matrix. Accuracy, precision, recall, and F-score will be calculated. The expands on past related work, which has used either a regression dataset alone to predict a Unified Parkinson’s Disease Rating Scale score for PD patients, or a classification dataset to determine healthy or PD diagnosis. In past work, the datasets have not been combined, and the regression set has not been used to contribute to evaluation of healthy subjects

    A Comparative Analysis of Machine Learning Models for Banking News Extraction by Multiclass Classification With Imbalanced Datasets of Financial News: Challenges and Solutions

    Get PDF
    Online portals provide an enormous amount of news articles every day. Over the years, numerous studies have concluded that news events have a significant impact on forecasting and interpreting the movement of stock prices. The creation of a framework for storing news-articles and collecting information for specific domains is an important and untested problem for the Indian stock market. When online news portals produce financial news articles about many subjects simultaneously, finding news articles that are important to the specific domain is nontrivial. A critical component of the aforementioned system should, therefore, include one module for extracting and storing news articles, and another module for classifying these text documents into a specific domain(s). In the current study, we have performed extensive experiments to classify the financial news articles into the predefined four classes Banking, Non-Banking, Governmental, and Global. The idea of multi-class classification was to extract the Banking news and its most correlated news articles from the pool of financial news articles scraped from various web news portals. The news articles divided into the mentioned classes were imbalanced. Imbalance data is a big difficulty with most classifier learning algorithms. However, as recent works suggest, class imbalances are not in themselves a problem, and degradation in performance is often correlated with certain variables relevant to data distribution, such as the existence in noisy and ambiguous instances in the adjacent class boundaries. A variety of solutions to addressing data imbalances have been proposed recently, over-sampling, down-sampling, and ensemble approach. We have presented the various challenges that occur with data imbalances in multiclass classification and solutions in dealing with these challenges. The paper has also shown a comparison of the performances of various machine learning models with imbalanced data and data balances using sampling and ensemble techniques. From the result, it’s clear that the performance of Random Forest classifier with data balances using the over-sampling technique SMOTE is best in terms of precision, recall, F-1, and accuracy. From the ensemble classifiers, the Balanced Bagging classifier has shown similar results as of the Random Forest classifier with SMOTE. Random forest classifier's accuracy, however, was 100% and it was 99% with the Balanced Bagging classifier
    • …
    corecore