62,693 research outputs found

    Subspace Support Vector Data Description and Extensions

    Get PDF
    Machine learning deals with discovering the knowledge that governs the learning process. The science of machine learning helps create techniques that enhance the capabilities of a system through the use of data. Typical machine learning techniques identify or predict different patterns in the data. In classification tasks, a machine learning model is trained using some training data to identify the unknown function that maps the input data to the output labels. The classification task gets challenging if the data from some categories are either unavailable or so diverse that they cannot be modelled statistically. For example, to train a model for anomaly detection, it is usually challenging to collect anomalous data for training, but the normal data is available in abundance. In such cases, it is possible to use One-Class Classification (OCC) techniques where the model is trained by using data only from one class. OCC algorithms are practical in situations where it is vital to identify one of the categories, but the examples from that specific category are scarce. Numerous OCC techniques have been proposed in the literature that model the data in the given feature space; however, such data can be high-dimensional or may not provide discriminative information for classification. In order to avoid the curse of dimensionality, standard dimensionality reduction techniques are commonly used as a preprocessing step in many machine learning algorithms. Principal Component Analysis (PCA) is an example of a widely used algorithm to transform data into a subspace suitable for the task at hand while maintaining the meaningful features of a given dataset. This thesis provides a new paradigm that jointly optimizes a subspace and data description for one-class classification via Support Vector Data Description (SVDD). We initiated the idea of subspace learning for one class classification by proposing a novel Subspace Support Vector Data Description (SSVDD) method, which was further extended to Ellipsoidal Subspace Support Vector Data Description (ESSVDD). ESSVDD generalizes SSVDD for a hypersphere by using ellipsoidal data description and it converges faster than SSVDD. It is important to train a joint model for multimodal data when data is collected from multiple sources. Therefore, we also proposed a multimodal approach, namely Multimodal Subspace Support Vector Data Description (MSSVDD) for transforming the data from multiple modalities to a common shared space for OCC. An important contribution of this thesis is to provide a framework unifying the subspace learning methods for SVDD. The proposed Graph-Embedded Subspace Support Vector Data Description (GESSVDD) framework helps revealing novel insights into the previously proposed methods and allows deriving novel variants that incorporate different optimization goals. The main focus of the thesis is on generic novel methods which can be adapted to different application domains. We experimented with standard datasets from different domains such as robotics, healthcare, and economics and achieved better performance than competing methods in most of the cases. We also proposed a taxa identification framework for rare benthic macroinvertebrates. Benthic macroinvertebrate taxa distribution is typically very imbalanced. The amounts of training images for the rarest classes are too low for properly training deep learning-based methods, while these rarest classes can be central in biodiversity monitoring. We show that the classic one-class classifiers in general, and the proposed methods in particular, can enhance a deep neural network classification performance for imbalanced datasets

    Parallel selective sampling method for imbalanced and large data classification

    Get PDF
    We proposed a new algorithm to preprocess huge and imbalanced data.This algorithm, based on distance calculations, reduce both size and imbalance.The selective sampling method was conceived for parallel and distributed computing.It was combined with SVM obtaining optimized classification performances.Synthetic and real data sets were used to evaluate the classifiers performances. Several applications aim to identify rare events from very large data sets. Classification algorithms may present great limitations on large data sets and show a performance degradation due to class imbalance. Many solutions have been presented in literature to deal with the problem of huge amount of data or imbalancing separately. In this paper we assessed the performances of a novel method, Parallel Selective Sampling (PSS), able to select data from the majority class to reduce imbalance in large data sets. PSS was combined with the Support Vector Machine (SVM) classification. PSS-SVM showed excellent performances on synthetic data sets, much better than SVM. Moreover, we showed that on real data sets PSS-SVM classifiers had performances slightly better than those of SVM and RUSBoost classifiers with reduced processing times. In fact, the proposed strategy was conceived and designed for parallel and distributed computing. In conclusion, PSS-SVM is a valuable alternative to SVM and RUSBoost for the problem of classification by huge and imbalanced data, due to its accurate statistical predictions and low computational complexity

    Effective and Efficient Optimization Methods for Kernel Based Classification Problems

    Get PDF
    Kernel methods are a popular choice in solving a number of problems in statistical machine learning. In this thesis, we propose new methods for two important kernel based classification problems: 1) learning from highly unbalanced large-scale datasets and 2) selecting a relevant subset of input features for a given kernel specification. The first problem is known as the rare class problem, which is characterized by a highly skewed or unbalanced class distribution. Unbalanced datasets can introduce significant bias in standard classification methods. In addition, due to the increase of data in recent years, large datasets with millions of observations have become commonplace. We propose an approach to address both the problem of bias and computational complexity in rare class problems by optimizing area under the receiver operating characteristic curve and by using a rare class only kernel representation, respectively. We justify the proposed approach theoretically and computationally. Theoretically, we establish an upper bound on the difference between selecting a hypothesis from a reproducing kernel Hilbert space and a hypothesis space which can be represented using a subset of kernel functions. This bound shows that for a fixed number of kernel functions, it is optimal to first include functions corresponding to rare class samples. We also discuss the connection of a subset kernel representation with the Nystrom method for a general class of regularized loss minimization methods. Computationally, we illustrate that the rare class representation produces statistically equivalent test error results on highly unbalanced datasets compared to using the full kernel representation, but with significantly better time and space complexity. Finally, we extend the method to rare class ordinal ranking, and apply it to a recent public competition problem in health informatics. The second problem studied in the thesis is known as the feature selection problem in literature. Embedding feature selection in kernel classification leads to a non-convex optimization problem. We specify a primal formulation and solve the problem using a second-order trust region algorithm. To improve efficiency, we use the two-block Gauss-Seidel method, breaking the problem into a convex support vector machine subproblem and a non-convex feature selection subproblem. We reduce possibility of saddle point convergence and improve solution quality by sharing an explicit functional margin variable between block iterates. We illustrate how our algorithm improves upon state-of-the-art methods

    Semantic concept detection in imbalanced datasets based on different under-sampling strategies

    Get PDF
    Semantic concept detection is a very useful technique for developing powerful retrieval or filtering systems for multimedia data. To date, the methods for concept detection have been converging on generic classification schemes. However, there is often imbalanced dataset or rare class problems in classification algorithms, which deteriorate the performance of many classifiers. In this paper, we adopt three “under-sampling” strategies to handle this imbalanced dataset issue in a SVM classification framework and evaluate their performances on the TRECVid 2007 dataset and additional positive samples from TRECVid 2010 development set. Experimental results show that our well-designed “under-sampling” methods (method SAK) increase the performance of concept detection about 9.6% overall. In cases of extreme imbalance in the collection the proposed methods worsen the performance than a baseline sampling method (method SI), however in the majority of cases, our proposed methods increase the performance of concept detection substantially. We also conclude that method SAK is a promising solution to address the SVM classification with not extremely imbalanced datasets

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
    corecore