741 research outputs found

    "Virus hunting" using radial distance weighted discrimination

    Get PDF
    Motivated by the challenge of using DNA-seq data to identify viruses in human blood samples, we propose a novel classification algorithm called "Radial Distance Weighted Discrimination" (or Radial DWD). This classifier is designed for binary classification, assuming one class is surrounded by the other class in very diverse radial directions, which is seen to be typical for our virus detection data. This separation of the 2 classes in multiple radial directions naturally motivates the development of Radial DWD. While classical machine learning methods such as the Support Vector Machine and linear Distance Weighted Discrimination can sometimes give reasonable answers for a given data set, their generalizability is severely compromised because of the linear separating boundary. Radial DWD addresses this challenge by using a more appropriate (in this particular case) spherical separating boundary. Simulations show that for appropriate radial contexts, this gives much better generalizability than linear methods, and also much better than conventional kernel based (nonlinear) Support Vector Machines, because the latter methods essentially use much of the information in the data for determining the shape of the separating boundary. The effectiveness of Radial DWD is demonstrated for real virus detection.Comment: Published at http://dx.doi.org/10.1214/15-AOAS869 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    GenSVM: a generalized multiclass support vector machine

    Get PDF
    Traditional extensions of the binary support vector machine (SVM) to multiclass problems are either heuristics or require solving a large dual optimization problem. Here, a generalized multiclass SVM is proposed called GenSVM. In this method classification boundaries for a K-class problem are constructed in a (K - 1)-dimensional space using a simplex encoding. Additionally, several different weightings of the misclassification errors are incorporated in the loss function, such that it generalizes three existing multiclass SVMs through a single optimization problem. An iterative majorization algorithm is derived that solves the optimization problem without the need of a dual formulation. This algorithm has the advantage that it can use warm starts during cross validation and during a grid search, which signifficantly speeds up the training phase. Rigorous numerical experiments compare linear GenSVM with seven existing multiclass SVMs on both small and large data sets. These comparisons show that the proposed method is competitive with existing methods in both predictive accuracy and training time, and that it signiffcantly outperforms several existing methods on these criteria

    On Reject and Refine Options in Multicategory Classification

    Full text link
    In many real applications of statistical learning, a decision made from misclassification can be too costly to afford; in this case, a reject option, which defers the decision until further investigation is conducted, is often preferred. In recent years, there has been much development for binary classification with a reject option. Yet, little progress has been made for the multicategory case. In this article, we propose margin-based multicategory classification methods with a reject option. In addition, and more importantly, we introduce a new and unique refine option for the multicategory problem, where the class of an observation is predicted to be from a set of class labels, whose cardinality is not necessarily one. The main advantage of both options lies in their capacity of identifying error-prone observations. Moreover, the refine option can provide more constructive information for classification by effectively ruling out implausible classes. Efficient implementations have been developed for the proposed methods. On the theoretical side, we offer a novel statistical learning theory and show a fast convergence rate of the excess â„“\ell-risk of our methods with emphasis on diverging dimensionality and number of classes. The results can be further improved under a low noise assumption. A set of comprehensive simulation and real data studies has shown the usefulness of the new learning tools compared to regular multicategory classifiers. Detailed proofs of theorems and extended numerical results are included in the supplemental materials available online.Comment: A revised version of this paper was accepted for publication in the Journal of the American Statistical Association Theory and Methods Section. 52 pages, 6 figure

    An Ensemble Framework Approach to Crop Type Prediction Using Feature Selection and Multiclass Classification

    Get PDF
    Crop type classification plays a crucial role in modern agriculture, aiding in yield prediction, resource management, and land-use planning. This paper presents a comprehensive framework for crop type classification utilizing a combination of feature selection techniques, robust classification Algorithm, and a Support Vector Machine (SVM)-based multiclass classification approach. The proposed framework begins with a novel feature selection process that identifies the most relevant attributes from the Agricultural Data and Rainfall data. This feature selection step is essential for reducing data dimensionality, enhancing classification accuracy, and improving model interpretability. Following feature selection, a state-of-the-art multiclass classification strategy based on Support Vector Machines is employed. SVMs are known for their capability to handle high-dimensional data and have demonstrated superior performance in various classification tasks. In this framework, SVMs are adapted to handle multiclass crop type classification efficiently. The model is trained on the selected features and optimized using hyperparameter tuning techniques to ensure robust performance

    Modelling, Monitoring, Control and Optimization for Complex Industrial Processes

    Get PDF
    This reprint includes 22 research papers and an editorial, collected from the Special Issue "Modelling, Monitoring, Control and Optimization for Complex Industrial Processes", highlighting recent research advances and emerging research directions in complex industrial processes. This reprint aims to promote the research field and benefit the readers from both academic communities and industrial sectors

    Gibbs Max-margin Topic Models with Data Augmentation

    Full text link
    Max-margin learning is a powerful approach to building classifiers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents an alternative approach by defining a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classifier to discover hidden topic representations for various tasks, including classification, regression and multi-task learning. Gibbs max-margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restricting assumptions and no need to solve SVM subproblems. Furthermore, each step of the "augment-and-collapse" Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results demonstrate significant improvements on time efficiency. The classification performance is also significantly improved over competitors on binary, multi-class and multi-label classification tasks.Comment: 35 page

    Semi-automated image analysis for the identification of bivalve larvae from a Cape Cod estuary

    Get PDF
    Author Posting. © Association for the Sciences of Limnology and Oceanography, 2012. This article is posted here by permission of Association for the Sciences of Limnology and Oceanography for personal use, not for redistribution. The definitive version was published in Limnology and Oceanography: Methods 10 (2012): 538-554, doi:10.4319/lom.2012.10.538.Machine-learning methods for identifying planktonic organisms are becoming well-established. Although similar morphologies among species make traditional image identification methods difficult for larval bivalves, species-specific shell birefringence patterns under polarized light permit identification by color and texture-based features. This approach uses cross-polarized images of bivalve larvae, extracts Gabor and color angle features from each image, and classifies images using a Support Vector Machine. We adapted this method, which was established on hatchery-reared larvae, to identify bivalve larvae from a series of field samples from a Cape Cod estuary in 2009. This method had 98% identification accuracy for four hatchery-reared species. We used a multiplex polymerase chain reaction (PCR) method to confirm field identifications and to compare accuracies to the software classifications. Image classification of larvae collected in the field had lower accuracies than both the classification of hatchery species and PCR-based identification due to error in visually classifying unknown larvae and variability in larval images from the field. A six-species field training set had the best correspondence to our visual classifications with 75% overall agreement and individual species agreements from 63% to 88%. Larval abundance estimates for a time-series of field samples showed good correspondence with visual methods after correction. Overall, this approach represents a cost- and time-saving alternative to molecular-based identifications and can produce sufficient results to address long-term abundance and transport-based questions on a species-specific level, a rarity in studies of bivalve larvae.This project was supported by an award to S. Gallager and C. Mingione Thompson from the Estuarine Reserves Division, Office of Ocean and Coastal Resource Management, National Ocean Service, National Oceanic and Atmospheric Administration and a grant from Woods Hole Oceanographic Institution’s Coastal Ocean Institute
    • …
    corecore