11 research outputs found

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    Learning from Imbalanced Multi-label Data Sets by Using Ensemble Strategies

    Get PDF
    Multi-label classification is an extension of conventional classification in which a single instance can be associated with multiple labels. Problems of this type are ubiquitous in everyday life. Such as, a movie can be categorized as action, crime, and thriller. Most algorithms on multi-label classification learning are designed for balanced data and don’t work well on imbalanced data. On the other hand, in real applications, most datasets are imbalanced. Therefore, we focused to improve multi-label classification performance on imbalanced datasets. In this paper, a state-of-the-art multi-label classification algorithm, which called IBLR_ML, is employed. This algorithm is produced from combination of k-nearest neighbor and logistic regression algorithms. Logistic regression part of this algorithm is combined with two ensemble learning algorithms, Bagging and Boosting. My approach is called IB-ELR. In this paper, for the first time, the ensemble bagging method whit stable learning as the base learner and imbalanced data sets as the training data is examined. Finally, to evaluate the proposed methods; they are implemented in JAVA language. Experimental results show the effectiveness of proposed methods. Keywords: Multi-label classification, Imbalanced data set, Ensemble learning, Stable algorithm, Logistic regression, Bagging, Boostin

    SMOTE: Synthetic Minority Over-sampling Technique

    Full text link
    An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of "normal" examples with only a small percentage of "abnormal" or "interesting" examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    Effective and Efficient Optimization Methods for Kernel Based Classification Problems

    Get PDF
    Kernel methods are a popular choice in solving a number of problems in statistical machine learning. In this thesis, we propose new methods for two important kernel based classification problems: 1) learning from highly unbalanced large-scale datasets and 2) selecting a relevant subset of input features for a given kernel specification. The first problem is known as the rare class problem, which is characterized by a highly skewed or unbalanced class distribution. Unbalanced datasets can introduce significant bias in standard classification methods. In addition, due to the increase of data in recent years, large datasets with millions of observations have become commonplace. We propose an approach to address both the problem of bias and computational complexity in rare class problems by optimizing area under the receiver operating characteristic curve and by using a rare class only kernel representation, respectively. We justify the proposed approach theoretically and computationally. Theoretically, we establish an upper bound on the difference between selecting a hypothesis from a reproducing kernel Hilbert space and a hypothesis space which can be represented using a subset of kernel functions. This bound shows that for a fixed number of kernel functions, it is optimal to first include functions corresponding to rare class samples. We also discuss the connection of a subset kernel representation with the Nystrom method for a general class of regularized loss minimization methods. Computationally, we illustrate that the rare class representation produces statistically equivalent test error results on highly unbalanced datasets compared to using the full kernel representation, but with significantly better time and space complexity. Finally, we extend the method to rare class ordinal ranking, and apply it to a recent public competition problem in health informatics. The second problem studied in the thesis is known as the feature selection problem in literature. Embedding feature selection in kernel classification leads to a non-convex optimization problem. We specify a primal formulation and solve the problem using a second-order trust region algorithm. To improve efficiency, we use the two-block Gauss-Seidel method, breaking the problem into a convex support vector machine subproblem and a non-convex feature selection subproblem. We reduce possibility of saddle point convergence and improve solution quality by sharing an explicit functional margin variable between block iterates. We illustrate how our algorithm improves upon state-of-the-art methods